Metis - Fast Automatic Distributed Training on Heterogeneous GPUs
Offered By: USENIX via YouTube
Course Description
Overview
Explore a groundbreaking conference talk from USENIX ATC '24 that introduces Metis, an innovative system for automatic distributed training on heterogeneous GPUs. Delve into the challenges of expanding deep learning model sizes and the need to utilize diverse GPU types efficiently. Learn how Metis optimizes key system components to leverage the compute powers and memory capacities of various GPU types, enabling fine-grained distribution of training workloads. Discover the novel search algorithm developed to efficiently prune large search spaces and balance loads with heterogeneity-awareness. Examine the evaluation results showcasing Metis' superior performance in finding optimal parallelism plans for large models like GPT-3, MoE, and Wide-Resnet across multiple GPU types. Gain insights into how Metis achieves significant training speed-ups while reducing profiling and search overheads compared to traditional methods and oracle planning.
Syllabus
USENIX ATC '24 - Metis: Fast Automatic Distributed Training on Heterogeneous GPUs
Taught by
USENIX
Related Courses
Future of Computing - IBM Power 9 and beyondopenHPI SIGCOMM 2020 - Reducto - On-Camera Filtering for Resource-Efficient Real-Time Video Analytics
Association for Computing Machinery (ACM) via YouTube Offload Annotations - Bringing Heterogeneous Computing to Existing Libraries and Workloads
USENIX via YouTube Supercomputing Spotlights - Supercomputing Software for Moore and Beyond
Society for Industrial and Applied Mathematics via YouTube Liquid Metal - Taming Heterogeneity
GOTO Conferences via YouTube