YoVDO

Building the GPT Tokenizer - From Strings to Tokens and Back

Offered By: Andrej Karpathy via YouTube

Tags

Unicode Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Dive into a comprehensive 2-hour lecture on building the GPT Tokenizer from scratch. Explore the crucial role of tokenization in Large Language Models (LLMs), understanding its separate training process and fundamental functions. Learn about Byte Pair Encoding, Unicode, and various encoding methods. Implement key components like encoding, decoding, and regex patterns. Compare different tokenizer libraries and examine tokenization quirks in LLMs. Gain hands-on experience through exercises, including creating your own GPT-4 tokenizer. Discover insights on multimodal tokenization and potential future improvements in the field.

Syllabus

intro: Tokenization, GPT-2 paper, tokenization-related issues
tokenization by example in a Web UI tiktokenizer
strings in Python, Unicode code points
Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
daydreaming: deleting tokenization
Byte Pair Encoding BPE algorithm walkthrough
starting the implementation
counting consecutive pairs, finding most common pair
merging the most common pair
training the tokenizer: adding the while loop, compression ratio
tokenizer/LLM diagram: it is a completely separate stage
decoding tokens to strings
encoding strings to tokens
regex patterns to force splits across categories
tiktoken library intro, differences between GPT-2/GPT-4 regex
GPT-2 encoder.py released by OpenAI walkthrough
special tokens, tiktoken handling of, GPT-2/GPT-4 differences
minbpe exercise time! write your own GPT-4 tokenizer
sentencepiece library intro, used to train Llama 2 vocabulary
how to set vocabulary set? revisiting gpt.py transformer
training new tokens, example of prompt compression
multimodal [image, video, audio] tokenization with vector quantization
revisiting and explaining the quirks of LLM tokenization
final recommendations
??? :


Taught by

Andrej Karpathy

Related Courses

Introduction to Internationalization and Localization
University of Washington via edX
Introduction Pratique à YAML
Coursera Project Network via Coursera
encoding and decoding in python
Udemy
Field Guide to Binary
Pluralsight
Design the Web: Creating and Protecting Email Links
LinkedIn Learning