YoVDO

Building Transformer Tokenizers - Dhivehi NLP #1

Offered By: James Briggs via YouTube

Tags

Natural Language Processing (NLP) Courses Transformer Models Courses Low-Resource Languages Courses

Course Description

Overview

Learn how to build an effective WordPiece tokenizer for Dhivehi, a low-resource language with a complex writing system. Explore the challenges of applying NLP to Dhivehi and follow along as the process of creating a custom tokenizer is demonstrated step-by-step. Discover the key components of tokenizer design, including normalization, pre-tokenization, post-tokenization, and decoding. Implement and train the tokenizer, test its functionality, and gain insights into working with low-resource languages in NLP. By the end of this tutorial, you'll have a solid understanding of tokenizer development for unique linguistic contexts and be able to apply these techniques to other low-resource languages.

Syllabus

Intro
Dhivehi Project
Hurdles for Low Resource Domains
Dhivehi Dataset
Download Dhivehi Corpus
Tokenizer Components
Normalizer Component
Pre-tokenization Component
Post-tokenization Component
Decoder Component
Tokenizer Implementation
Tokenizer Training
Post-processing Implementation
Decoder Implementation
Saving for Transformers
Tokenizer Test and Usage
Download Dhivehi Models
First Steps


Taught by

James Briggs

Related Courses

Natural Language Processing
Columbia University via Coursera
Natural Language Processing
Stanford University via Coursera
Introduction to Natural Language Processing
University of Michigan via Coursera
moocTLH: Nuevos retos en las tecnologĂ­as del lenguaje humano
Universidad de Alicante via MirĂ­adax
Natural Language Processing
Indian Institute of Technology, Kharagpur via Swayam