MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs
Offered By: USENIX via YouTube
Course Description
Overview
Explore the groundbreaking research on scaling large language model (LLM) training to over 10,000 GPUs in this conference talk from NSDI '24. Dive into the design, implementation, and engineering challenges of MegaScale, a production system developed by ByteDance and Peking University researchers. Learn about the full-stack approach that co-designs algorithmic and system components to address unprecedented challenges in training efficiency and stability. Discover innovative techniques for model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline improvements, and network performance tuning. Gain insights into maintaining high efficiency throughout long-duration LLM training jobs and the importance of in-depth observability to tackle hard stability issues that emerge at large scale. Examine the set of diagnosis tools developed to monitor system components, identify root causes, and implement effective fault tolerance and straggler mitigation techniques. Understand how MegaScale achieves a 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, surpassing Megatron-LM by 1.34x. Benefit from the operational experience shared in identifying and fixing failures and stragglers, and gain valuable insights for future LLM systems research.
Syllabus
NSDI '24 - MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Taught by
USENIX
Related Courses
Cloud Computing Concepts, Part 1University of Illinois at Urbana-Champaign via Coursera Cloud Computing Concepts: Part 2
University of Illinois at Urbana-Champaign via Coursera Reliable Distributed Algorithms - Part 1
KTH Royal Institute of Technology via edX Introduction to Apache Spark and AWS
University of London International Programmes via Coursera Réalisez des calculs distribués sur des données massives
CentraleSupélec via OpenClassrooms