The Key to Cost-Efficient Quality Text Annotation - Data Pre-Processing

Offered By: BasisTech via YouTube

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Discover effective strategies for optimizing text annotation processes in this conference talk from HLTCon 2021. Learn how to significantly reduce redundant manual work and streamline the labor-intensive, expensive process of data annotation through proper pre-processing techniques. Explore methods for deduplicating data sets, minimizing annotator effort, and preventing data leakage across training and test sets. Delve into textual deduplication using Locality Sensitive Hashing and gain insights into text normalization for various languages, including Chinese, Japanese, Korean, Arabic, and European languages with accents. Understand the complexities of Unicode representations and their impact on machine learning models. Acquire valuable knowledge from an experienced professional in managing training data processes to enhance the efficiency and quality of your text annotation projects.

Syllabus

Intro
Blueprint for Supervised Machine Learning
Goal: Maximize Manual Annotation Efficiency 1. Deduplicate • Minimize manual effort . Find unique subjects in our data sets so that humans only annotate each subject once and to prevent leaking duplicate data across training & test
Part 1-Textual Deduplication: Measuring Similarity How can we find
Time Complexity of Pairwise Comparisons
Textual Deduplication: LSH Bitwise Rotations
Locality Sensitive Hashing: 32 Bit Simhash
Part 2 - Text Normalization Machine Representations
Text Normalization: Unicode Examples What's the difference?
Text Normalization: Halfwidth & Fullwidth Katakana
Text Normalization: Katakana Code Block
Text Normalization: Halfwidth & Fullwidth Forms
Text Normalization: Hebrew Presentation Forms
Text Normalization: Unicode Normalization Forms
Text Normalization: Composing Marks Normalization
Text Normalization: Katakana Normalization
Text Normalization: Hebrew Normalization
Additional Normalization Resources
Conclusion
Attributions: The Noun Project

Taught by

BasisTech

The Key to Cost-Efficient Quality Text Annotation - Data Pre-Processing

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

The Key to Cost-Efficient Quality Text Annotation - Data Pre-Processing

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Login to Continue