YoVDO

Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization

Offered By: Andrej Karpathy via YouTube

Tags

GPT-2 Courses Machine Learning Courses Deep Learning Courses Neural Networks Courses PyTorch Courses Transformer Models Courses Model Training Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Embark on a comprehensive 4-hour journey to reproduce GPT-2 (124M) from scratch in this in-depth video tutorial. Explore the entire process, from building the GPT-2 network to optimizing its training for maximum efficiency. Follow along as the instructor sets up the training run according to GPT-2 and GPT-3 paper specifications, initiates the process, and analyzes the results. Gain insights into model architecture, parameter loading, forward pass implementation, sampling techniques, and data handling. Dive into advanced topics such as mixed precision training, GPU optimization, gradient accumulation, and distributed data parallel processing. Learn about hyperparameter tuning, learning rate scheduling, and evaluation methods. By the end, you'll have a thorough understanding of building and training a GPT-2 model, with practical knowledge applicable to larger language models.

Syllabus

intro: Let’s reproduce GPT-2 124M
exploring the GPT-2 124M OpenAI checkpoint
SECTION 1: implementing the GPT-2 nn.Module
loading the huggingface/GPT-2 parameters
implementing the forward pass to get logits
sampling init, prefix tokens, tokenization
sampling loop
sample, auto-detect the device
let’s train: data batches B,T → logits B,T,C
cross entropy loss
optimization loop: overfit a single batch
data loader lite
parameter sharing wte and lm_head
model initialization: std 0.02, residual init
SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms
Tensor Cores, timing the code, TF32 precision, 333ms
float16, gradient scalers, bfloat16, 300ms
torch.compile, Python overhead, kernel fusion, 130ms
flash attention, 96ms
nice/ugly numbers. vocab size 50257 → 50304, 93ms
SECTION 3: hyperpamaters, AdamW, gradient clipping
learning rate scheduler: warmup + cosine decay
batch size schedule, weight decay, FusedAdamW, 90ms
gradient accumulation
distributed data parallel DDP
datasets used in GPT-2, GPT-3, FineWeb EDU
validation data split, validation loss, sampling revive
evaluation: HellaSwag, starting the run
SECTION 4: results in the morning! GPT-2, GPT-3 repro
shoutout to llm.c, equivalent but faster code in raw C/CUDA
summary, phew, build-nanogpt github repo


Taught by

Andrej Karpathy

Related Courses

How Google does Machine Learning en Español
Google Cloud via Coursera
Creating Custom Callbacks in Keras
Coursera Project Network via Coursera
Automatic Machine Learning with H2O AutoML and Python
Coursera Project Network via Coursera
AI in Healthcare Capstone
Stanford University via Coursera
AutoML con Pycaret y TPOT
Coursera Project Network via Coursera