YoVDO

MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs

Offered By: USENIX via YouTube

Tags

Distributed Computing Courses Fault Tolerance Courses GPU Computing Courses Scalability Courses Model Training Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the groundbreaking research on scaling large language model (LLM) training to over 10,000 GPUs in this conference talk from NSDI '24. Dive into the design, implementation, and engineering challenges of MegaScale, a production system developed by ByteDance and Peking University researchers. Learn about the full-stack approach that co-designs algorithmic and system components to address unprecedented challenges in training efficiency and stability. Discover innovative techniques for model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline improvements, and network performance tuning. Gain insights into maintaining high efficiency throughout long-duration LLM training jobs and the importance of in-depth observability to tackle hard stability issues that emerge at large scale. Examine the set of diagnosis tools developed to monitor system components, identify root causes, and implement effective fault tolerance and straggler mitigation techniques. Understand how MegaScale achieves a 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, surpassing Megatron-LM by 1.34x. Benefit from the operational experience shared in identifying and fixing failures and stragglers, and gain valuable insights for future LLM systems research.

Syllabus

NSDI '24 - MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs


Taught by

USENIX

Related Courses

Моделирование биологических молекул на GPU (Biomolecular modeling on GPU)
Moscow Institute of Physics and Technology via Coursera
Practical Deep Learning For Coders
fast.ai via Independent
GPU Architectures And Programming
Indian Institute of Technology, Kharagpur via Swayam
Perform Real-Time Object Detection with YOLOv3
Coursera Project Network via Coursera
Getting Started with PyTorch
Coursera Project Network via Coursera