Evaluating Text Extraction: Apache Tika's New Tika-Eval Module
Offered By: Linux Foundation via YouTube
Course Description
Overview
Explore the new tika-eval module for evaluating text extraction tools in this 44-minute conference talk by Tim Allison from The MITRE Corporation. Learn about the importance of text extraction in various applications, including search and natural language processing. Discover how Apache Tika™ detects file types and extracts metadata and text from numerous file formats. Gain insights into the evaluation methodology for content extraction systems, including metrics, limitations, and real-world results from testing on public domain documents. Understand common challenges in text extraction, such as hidden problems and missing text. Delve into topics like regression testing, evaluation metrics, and the importance of human interpretation in the evaluation process. Benefit from the speaker's extensive experience in natural language processing and content extraction as he shares valuable resources and conclusions about this crucial component in many popular tools like Solr™, Nutch™, and Elasticsearch.
Syllabus
Introduction
Overview
Whats different
Content Extraction
Metadata
Blood on the Highway
Search
Regression Testing
What Can Go Wrong
Hidden Problems
Example of Missing Text
Dream
Evaluation Metric
TikaEval Overview
TikaEval Definitions
Why TikaEval
TikaEval
Profile
Compare
StartDB
Profile Reports
Common Words Metric
Similarity Metric
Common Word Metric
Evaluation Metric Public
Limitations
Human Interpretation
Conclusion
Resources
Thank you
Data import handler
Metadata normalization
Application dependent
Taught by
Linux Foundation
Tags
Related Courses
Software TestingNPTEL via Swayam Specialize in QA Manual Testing with Live Project+AGILE+JIRA
Udemy Software Testing Foundations: Bug Writing and Management
LinkedIn Learning Software Testing Foundations: Test Management
LinkedIn Learning Software Testing
NPTEL via YouTube