YoVDO

Building Transformer Tokenizers - Dhivehi NLP #1

Offered By: James Briggs via YouTube

Tags

Natural Language Processing (NLP) Courses Transformer Models Courses Low-Resource Languages Courses

Course Description

Overview

Learn how to build an effective WordPiece tokenizer for Dhivehi, a low-resource language with a complex writing system. Explore the challenges of applying NLP to Dhivehi and follow along as the process of creating a custom tokenizer is demonstrated step-by-step. Discover the key components of tokenizer design, including normalization, pre-tokenization, post-tokenization, and decoding. Implement and train the tokenizer, test its functionality, and gain insights into working with low-resource languages in NLP. By the end of this tutorial, you'll have a solid understanding of tokenizer development for unique linguistic contexts and be able to apply these techniques to other low-resource languages.

Syllabus

Intro
Dhivehi Project
Hurdles for Low Resource Domains
Dhivehi Dataset
Download Dhivehi Corpus
Tokenizer Components
Normalizer Component
Pre-tokenization Component
Post-tokenization Component
Decoder Component
Tokenizer Implementation
Tokenizer Training
Post-processing Implementation
Decoder Implementation
Saving for Transformers
Tokenizer Test and Usage
Download Dhivehi Models
First Steps


Taught by

James Briggs

Related Courses

Low Resource Machine Translation
Alfredo Canziani via YouTube
CMU Multilingual NLP - The LORELEI Project
Graham Neubig via YouTube
CMU Multilingual NLP - Information Extraction
Graham Neubig via YouTube
CMU Multilingual NLP 2020 - Text to Speech
Graham Neubig via YouTube
CMU Multilingual NLP 2020 - Low Resource ASR
Graham Neubig via YouTube