MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
Offered By: USENIX via YouTube
Course Description
Overview
Explore a conference talk on MAST, a global scheduler for ML training workloads across geo-distributed datacenters at hyperscale. Learn about the challenges of manual datacenter region selection in public clouds and how MAST addresses these issues in Meta's private cloud. Discover the three key design principles enabling MAST to schedule complex ML training workloads globally: temporal decoupling, scope decoupling, and exhaustive search. Understand how MAST successfully balances load across global regions, reducing the GPU demand-to-supply ratio for high-priority workloads from 2.63 to 0.98 in the most overloaded region. Gain insights into the global-scheduling abstraction provided by MAST and its impact on hardware utilization and profitability.
Syllabus
OSDI '24 - MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
Taught by
USENIX
Related Courses
Software as a ServiceUniversity of California, Berkeley via Coursera Software Defined Networking
Georgia Institute of Technology via Coursera Pattern-Oriented Software Architectures: Programming Mobile Services for Android Handheld Systems
Vanderbilt University via Coursera Web-Technologien
openHPI Données et services numériques, dans le nuage et ailleurs
Certificat informatique et internet via France Université Numerique