The Key to Cost-Efficient Quality Text Annotation - Data Pre-Processing
Offered By: BasisTech via YouTube
Course Description
Overview
Syllabus
Intro
Blueprint for Supervised Machine Learning
Goal: Maximize Manual Annotation Efficiency 1. Deduplicate • Minimize manual effort . Find unique subjects in our data sets so that humans only annotate each subject once and to prevent leaking duplicate data across training & test
Part 1-Textual Deduplication: Measuring Similarity How can we find
Time Complexity of Pairwise Comparisons
Textual Deduplication: LSH Bitwise Rotations
Locality Sensitive Hashing: 32 Bit Simhash
Part 2 - Text Normalization Machine Representations
Text Normalization: Unicode Examples What's the difference?
Text Normalization: Halfwidth & Fullwidth Katakana
Text Normalization: Katakana Code Block
Text Normalization: Halfwidth & Fullwidth Forms
Text Normalization: Hebrew Presentation Forms
Text Normalization: Unicode Normalization Forms
Text Normalization: Composing Marks Normalization
Text Normalization: Katakana Normalization
Text Normalization: Hebrew Normalization
Additional Normalization Resources
Conclusion
Attributions: The Noun Project
Taught by
BasisTech
Related Courses
Introduction to Internationalization and LocalizationUniversity of Washington via edX Introduction Pratique à YAML
Coursera Project Network via Coursera encoding and decoding in python
Udemy Field Guide to Binary
Pluralsight Design the Web: Creating and Protecting Email Links
LinkedIn Learning