Themis - Fair and Efficient GPU Cluster Scheduling
Offered By: USENIX via YouTube
Course Description
Overview
Explore a 22-minute conference talk from USENIX NSDI '20 that introduces Themis, a novel scheduling framework for GPU cluster management in distributed machine learning workloads. Dive into the challenges of fair and efficient GPU allocation across multiple ML jobs, and discover how Themis addresses these issues through a unique two-level scheduling architecture. Learn about the concept of finish-time fairness and how it's implemented using an auction-based resource allocation system. Examine the framework's performance compared to existing schedulers, with insights on improved fairness and cluster efficiency. Gain valuable knowledge on GPU cluster scheduling, resource allocation strategies, and the specific needs of ML training workloads in shared environments.
Syllabus
Intro
Deep Learning at a Large Enterprise
GPU Cluster Scheduler: Goal
Existing GPU Cluster Schedulers
GPU Cluster Scheduler: Drawback 2
GPU Cluster Scheduler: Requirement 2
Towards a new GPU Cluster Scheduler
Themis: Metric
Themis: Finish-Time Fairness Metric
Themis: Interface
Strawman Mechanism: Issues
Themis: Mechanism: Partial Allocation Auction
Themis: Overall Design
Themis: Implementation
Themis: Evaluation
Macrobenchmark: Sharing Incentive
Macrobenchmark: Efficiency
Conclusion
Taught by
USENIX
Related Courses
Scaling Memcache at FacebookUSENIX via YouTube Multi-Person Localization via RF Body Reflections
USENIX via YouTube Opaque - An Oblivious and Encrypted Distributed Analytics Platform
USENIX via YouTube Live Video Analytics at Scale with Approximation and Delay-Tolerance
USENIX via YouTube Clipper - A Low-Latency Online Prediction Serving System
USENIX via YouTube