YoVDO

CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters

Offered By: USENIX via YouTube

Tags

Distributed Computing Courses Machine Learning Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a groundbreaking conference talk on CASSINI, a network-aware job scheduler for machine learning clusters. Delve into the innovative geometric abstraction introduced to consider communication patterns of different jobs during network link placement. Learn about the Affinity graph technique that finds time-shift values to interleave communication phases of jobs sharing the same network link. Discover how CASSINI improves average and tail completion times of jobs by up to 1.6x and 2.5x respectively, compared to state-of-the-art ML schedulers. Examine experimental results from 13 common ML models on a 24-server testbed, showcasing CASSINI's ability to reduce ECN marked packets in the cluster by up to 33x. Gain insights into advanced network-aware scheduling techniques for optimizing machine learning cluster performance.

Syllabus

NSDI '24 - CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters


Taught by

USENIX

Related Courses

Cloud Computing Concepts, Part 1
University of Illinois at Urbana-Champaign via Coursera
Cloud Computing Concepts: Part 2
University of Illinois at Urbana-Champaign via Coursera
Reliable Distributed Algorithms - Part 1
KTH Royal Institute of Technology via edX
Introduction to Apache Spark and AWS
University of London International Programmes via Coursera
Réalisez des calculs distribués sur des données massives
CentraleSupélec via OpenClassrooms