The Key to Cost-Efficient Quality Text Annotation - Data Pre-Processing
Offered By: BasisTech via YouTube
Course Description
Overview
Syllabus
Intro
Blueprint for Supervised Machine Learning
Goal: Maximize Manual Annotation Efficiency 1. Deduplicate • Minimize manual effort . Find unique subjects in our data sets so that humans only annotate each subject once and to prevent leaking duplicate data across training & test
Part 1-Textual Deduplication: Measuring Similarity How can we find
Time Complexity of Pairwise Comparisons
Textual Deduplication: LSH Bitwise Rotations
Locality Sensitive Hashing: 32 Bit Simhash
Part 2 - Text Normalization Machine Representations
Text Normalization: Unicode Examples What's the difference?
Text Normalization: Halfwidth & Fullwidth Katakana
Text Normalization: Katakana Code Block
Text Normalization: Halfwidth & Fullwidth Forms
Text Normalization: Hebrew Presentation Forms
Text Normalization: Unicode Normalization Forms
Text Normalization: Composing Marks Normalization
Text Normalization: Katakana Normalization
Text Normalization: Hebrew Normalization
Additional Normalization Resources
Conclusion
Attributions: The Noun Project
Taught by
BasisTech
Related Courses
Machine LearningUniversity of Washington via Coursera Machine Learning
Stanford University via Coursera Machine Learning
Georgia Institute of Technology via Udacity Statistical Learning with R
Stanford University via edX Machine Learning 1—Supervised Learning
Brown University via Udacity