YoVDO

MLaaS in the Wild - Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters

Offered By: USENIX via YouTube

Tags

USENIX Symposium on Networked Systems Design and Implementation (NSDI) Courses

Course Description

Overview

Explore a comprehensive analysis of Machine Learning as a Service (MLaaS) workloads in large-scale heterogeneous GPU clusters through this 15-minute conference talk from NSDI '22. Dive into the challenges of running diverse ML workloads, including low GPU utilization, long queueing delays, and scheduling complexities. Examine a two-month workload trace from Alibaba's production MLaaS cluster with over 6,000 GPUs, and learn about current solutions and open challenges in cluster scheduling. Gain insights into resource requests, machine utilization, GPU sharing, task duration prediction, and potential CPU bottlenecks. Understand the implications of imbalanced scheduling across heterogeneous machines and discover key takeaways for optimizing large-scale ML infrastructure.

Syllabus

Intro
Production ML Workloads
Trace Overview
Run-time and Queueing delays
Resource Requests & Usage
Machine Resource Utilization
GPU Sharing
Duration Predict for Recurring Tasks
CPU can be the bottleneck
Imbalanced Scheduling
Takeaways


Taught by

USENIX

Related Courses

Scaling Memcache at Facebook
USENIX via YouTube
Multi-Person Localization via RF Body Reflections
USENIX via YouTube
Opaque - An Oblivious and Encrypted Distributed Analytics Platform
USENIX via YouTube
Live Video Analytics at Scale with Approximation and Delay-Tolerance
USENIX via YouTube
Clipper - A Low-Latency Online Prediction Serving System
USENIX via YouTube