Evaluating Text Extraction: Apache Tika's New Tika-Eval Module

Offered By: Linux Foundation via YouTube

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Explore the new tika-eval module for evaluating text extraction tools in this 44-minute conference talk by Tim Allison from The MITRE Corporation. Learn about the importance of text extraction in various applications, including search and natural language processing. Discover how Apache Tika™ detects file types and extracts metadata and text from numerous file formats. Gain insights into the evaluation methodology for content extraction systems, including metrics, limitations, and real-world results from testing on public domain documents. Understand common challenges in text extraction, such as hidden problems and missing text. Delve into topics like regression testing, evaluation metrics, and the importance of human interpretation in the evaluation process. Benefit from the speaker's extensive experience in natural language processing and content extraction as he shares valuable resources and conclusions about this crucial component in many popular tools like Solr™, Nutch™, and Elasticsearch.

Syllabus

Introduction
Overview
Whats different
Content Extraction
Metadata
Blood on the Highway
Search
Regression Testing
What Can Go Wrong
Hidden Problems
Example of Missing Text
Dream
Evaluation Metric
TikaEval Overview
TikaEval Definitions
Why TikaEval
TikaEval
Profile
Compare
StartDB
Profile Reports
Common Words Metric
Similarity Metric
Common Word Metric
Evaluation Metric Public
Limitations
Human Interpretation
Conclusion
Resources
Thank you
Data import handler
Metadata normalization
Application dependent

Taught by

Linux Foundation

Evaluating Text Extraction: Apache Tika's New Tika-Eval Module

Tags

Course Description

Overview

Syllabus

Taught by

Tags

Related Courses

Evaluating Text Extraction: Apache Tika's New Tika-Eval Module

Tags

Course Description

Overview

Syllabus

Taught by

Tags

Related Courses

Login to Continue