Pollux - Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
Offered By: USENIX via YouTube
Course Description
Overview
Explore a cutting-edge approach to deep learning cluster scheduling in this 14-minute conference talk from OSDI '21. Dive into Pollux, a co-adaptive cluster scheduler that optimizes goodput in deep learning environments. Learn how this innovative system simultaneously considers per-job and cluster-wide factors to improve resource allocation and utilization. Discover the novel goodput metric that combines system throughput with statistical efficiency, and understand how Pollux dynamically reassigns resources to enhance overall cluster performance. Gain insights into the system's ability to reduce average job completion times, promote fairness, and potentially lower costs in cloud environments. Examine the background of distributed deep learning, the impact of batch size on system throughput and statistical efficiency, and the key components of Pollux's cluster scheduler. Delve into the evaluation results and broader implications of this groundbreaking approach to deep learning cluster management.
Syllabus
Intro
Deep Learning Training in Shared Clusters
Example Shared-Cluster DL Training Workflow
Pollux: Co-adaptive Cluster Scheduler for DL
Outline
Background: Distributed DL (Data Parallelism)
System Throughput and Impact of Batch Size
Statistical Efficiency and Impact of Batch Size
illustration of Overall Training Performance
Implications for Cluster Scheduling
Pollux Cluster Scheduler
Key Idea: Goodput, not Throughput
Modeling System Throughput
Modeling Statistical Efficiency
Optimizing Cluster-Wide Allocations
Evaluation of Pollux
Cluster-Wide Statistical Efficiency
More Experiments in our Paper!
Conclusion
Taught by
USENIX
Related Courses
GraphX - Graph Processing in a Distributed Dataflow FrameworkUSENIX via YouTube Theseus - An Experiment in Operating System Structure and State Management
USENIX via YouTube RedLeaf - Isolation and Communication in a Safe Operating System
USENIX via YouTube Microsecond Consensus for Microsecond Applications
USENIX via YouTube KungFu - Making Training in Distributed Machine Learning Adaptive
USENIX via YouTube