The Key to Cost-Efficient Quality Text Annotation - Data Pre-Processing
Offered By: BasisTech via YouTube
Course Description
Overview
Syllabus
Intro
Blueprint for Supervised Machine Learning
Goal: Maximize Manual Annotation Efficiency 1. Deduplicate • Minimize manual effort . Find unique subjects in our data sets so that humans only annotate each subject once and to prevent leaking duplicate data across training & test
Part 1-Textual Deduplication: Measuring Similarity How can we find
Time Complexity of Pairwise Comparisons
Textual Deduplication: LSH Bitwise Rotations
Locality Sensitive Hashing: 32 Bit Simhash
Part 2 - Text Normalization Machine Representations
Text Normalization: Unicode Examples What's the difference?
Text Normalization: Halfwidth & Fullwidth Katakana
Text Normalization: Katakana Code Block
Text Normalization: Halfwidth & Fullwidth Forms
Text Normalization: Hebrew Presentation Forms
Text Normalization: Unicode Normalization Forms
Text Normalization: Composing Marks Normalization
Text Normalization: Katakana Normalization
Text Normalization: Hebrew Normalization
Additional Normalization Resources
Conclusion
Attributions: The Noun Project
Taught by
BasisTech
Related Courses
Introduction to Artificial IntelligenceStanford University via Udacity Natural Language Processing
Columbia University via Coursera Probabilistic Graphical Models 1: Representation
Stanford University via Coursera Computer Vision: The Fundamentals
University of California, Berkeley via Coursera Learning from Data (Introductory Machine Learning course)
California Institute of Technology via Independent