MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs
Offered By: USENIX via YouTube
Course Description
Overview
Explore the groundbreaking research on scaling large language model (LLM) training to over 10,000 GPUs in this conference talk from NSDI '24. Dive into the design, implementation, and engineering challenges of MegaScale, a production system developed by ByteDance and Peking University researchers. Learn about the full-stack approach that co-designs algorithmic and system components to address unprecedented challenges in training efficiency and stability. Discover innovative techniques for model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline improvements, and network performance tuning. Gain insights into maintaining high efficiency throughout long-duration LLM training jobs and the importance of in-depth observability to tackle hard stability issues that emerge at large scale. Examine the set of diagnosis tools developed to monitor system components, identify root causes, and implement effective fault tolerance and straggler mitigation techniques. Understand how MegaScale achieves a 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, surpassing Megatron-LM by 1.34x. Benefit from the operational experience shared in identifying and fixing failures and stragglers, and gain valuable insights for future LLM systems research.
Syllabus
NSDI '24 - MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Taught by
USENIX
Related Courses
Моделирование биологических молекул на GPU (Biomolecular modeling on GPU)Moscow Institute of Physics and Technology via Coursera Practical Deep Learning For Coders
fast.ai via Independent GPU Architectures And Programming
Indian Institute of Technology, Kharagpur via Swayam Perform Real-Time Object Detection with YOLOv3
Coursera Project Network via Coursera Getting Started with PyTorch
Coursera Project Network via Coursera