Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization
Offered By: Andrej Karpathy via YouTube
Course Description
Overview
Syllabus
intro: Let’s reproduce GPT-2 124M
exploring the GPT-2 124M OpenAI checkpoint
SECTION 1: implementing the GPT-2 nn.Module
loading the huggingface/GPT-2 parameters
implementing the forward pass to get logits
sampling init, prefix tokens, tokenization
sampling loop
sample, auto-detect the device
let’s train: data batches B,T → logits B,T,C
cross entropy loss
optimization loop: overfit a single batch
data loader lite
parameter sharing wte and lm_head
model initialization: std 0.02, residual init
SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms
Tensor Cores, timing the code, TF32 precision, 333ms
float16, gradient scalers, bfloat16, 300ms
torch.compile, Python overhead, kernel fusion, 130ms
flash attention, 96ms
nice/ugly numbers. vocab size 50257 → 50304, 93ms
SECTION 3: hyperpamaters, AdamW, gradient clipping
learning rate scheduler: warmup + cosine decay
batch size schedule, weight decay, FusedAdamW, 90ms
gradient accumulation
distributed data parallel DDP
datasets used in GPT-2, GPT-3, FineWeb EDU
validation data split, validation loss, sampling revive
evaluation: HellaSwag, starting the run
SECTION 4: results in the morning! GPT-2, GPT-3 repro
shoutout to llm.c, equivalent but faster code in raw C/CUDA
summary, phew, build-nanogpt github repo
Taught by
Andrej Karpathy
Related Courses
How Google does Machine Learning en EspañolGoogle Cloud via Coursera Creating Custom Callbacks in Keras
Coursera Project Network via Coursera Automatic Machine Learning with H2O AutoML and Python
Coursera Project Network via Coursera AI in Healthcare Capstone
Stanford University via Coursera AutoML con Pycaret y TPOT
Coursera Project Network via Coursera