MLaaS in the Wild - Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters

Offered By: USENIX via YouTube

Course Description

Overview

Explore a comprehensive analysis of Machine Learning as a Service (MLaaS) workloads in large-scale heterogeneous GPU clusters through this 15-minute conference talk from NSDI '22. Dive into the challenges of running diverse ML workloads, including low GPU utilization, long queueing delays, and scheduling complexities. Examine a two-month workload trace from Alibaba's production MLaaS cluster with over 6,000 GPUs, and learn about current solutions and open challenges in cluster scheduling. Gain insights into resource requests, machine utilization, GPU sharing, task duration prediction, and potential CPU bottlenecks. Understand the implications of imbalanced scheduling across heterogeneous machines and discover key takeaways for optimizing large-scale ML infrastructure.

Syllabus

Intro
Production ML Workloads
Trace Overview
Run-time and Queueing delays
Resource Requests & Usage
Machine Resource Utilization
GPU Sharing
Duration Predict for Recurring Tasks
CPU can be the bottleneck
Imbalanced Scheduling
Takeaways

Taught by

USENIX

MLaaS in the Wild - Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

MLaaS in the Wild - Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Login to Continue