YoVDO

Building Transformer Tokenizers - Dhivehi NLP #1

Offered By: James Briggs via YouTube

Tags

Natural Language Processing (NLP) Courses Transformer Models Courses Low-Resource Languages Courses

Course Description

Overview

Learn how to build an effective WordPiece tokenizer for Dhivehi, a low-resource language with a complex writing system. Explore the challenges of applying NLP to Dhivehi and follow along as the process of creating a custom tokenizer is demonstrated step-by-step. Discover the key components of tokenizer design, including normalization, pre-tokenization, post-tokenization, and decoding. Implement and train the tokenizer, test its functionality, and gain insights into working with low-resource languages in NLP. By the end of this tutorial, you'll have a solid understanding of tokenizer development for unique linguistic contexts and be able to apply these techniques to other low-resource languages.

Syllabus

Intro
Dhivehi Project
Hurdles for Low Resource Domains
Dhivehi Dataset
Download Dhivehi Corpus
Tokenizer Components
Normalizer Component
Pre-tokenization Component
Post-tokenization Component
Decoder Component
Tokenizer Implementation
Tokenizer Training
Post-processing Implementation
Decoder Implementation
Saving for Transformers
Tokenizer Test and Usage
Download Dhivehi Models
First Steps


Taught by

James Briggs

Related Courses

Sequence Models
DeepLearning.AI via Coursera
Modern Natural Language Processing in Python
Udemy
Stanford Seminar - Transformers in Language: The Development of GPT Models Including GPT-3
Stanford University via YouTube
Long Form Question Answering in Haystack
James Briggs via YouTube
Spotify's Podcast Search Explained
James Briggs via YouTube