YoVDO

MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale

Offered By: USENIX via YouTube

Tags

Distributed Computing Courses Machine Learning Courses Cloud Computing Courses Hyperscale Computing Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a conference talk on MAST, a global scheduler for ML training workloads across geo-distributed datacenters at hyperscale. Learn about the challenges of manual datacenter region selection in public clouds and how MAST addresses these issues in Meta's private cloud. Discover the three key design principles enabling MAST to schedule complex ML training workloads globally: temporal decoupling, scope decoupling, and exhaustive search. Understand how MAST successfully balances load across global regions, reducing the GPU demand-to-supply ratio for high-priority workloads from 2.63 to 0.98 in the most overloaded region. Gain insights into the global-scheduling abstraction provided by MAST and its impact on hardware utilization and profitability.

Syllabus

OSDI '24 - MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale


Taught by

USENIX

Related Courses

Inference and Quantization for AI - Session 3
Nvidia via YouTube
Hyperscale vDPA: Scaling Virtual Data Path Acceleration
Linux Foundation via YouTube
Multiple Workloads and Protocols - One Software-Defined Solution for Flash Storage
Linux Foundation via YouTube
What If Flash Was Software Defined - Revolutionizing Data Storage
Linux Foundation via YouTube
Unlocking the Power of Flash with Open Source Software-Enabled Flash Technology
Linux Foundation via YouTube