YoVDO

A Case for Task Sampling Based Learning for Cluster Job Scheduling

Offered By: USENIX via YouTube

Tags

USENIX Symposium on Networked Systems Design and Implementation (NSDI) Courses

Course Description

Overview

Explore a novel approach to cluster job scheduling in this 16-minute conference talk from NSDI '22. Dive into the challenges of accurately estimating job runtime properties and learn about SLearn, a task sampling-based learning method that outperforms traditional history-based predictors. Discover how SLearn exploits similarities among tasks within the same job to achieve more accurate predictions, even in rapidly changing cluster environments. Examine analytical and experimental analyses of production traces, and understand how SLearn reduces average Job Completion Time (JCT) compared to prior-art methods. Gain insights into SLearn's implementation, design, and potential applications for DAG jobs, as well as its performance in simulation and testbed experiments on Azure using real-world cluster job traces.

Syllabus

Authors Introduction
Challenges in Cluster Scheduling
Learning Runtime Properties for Cluster Scheduling
Widely Used Approach for Learning: History-based Learning
History-based Learning: Assumptions and Reality
Poor Performance of the State-of-the-Art History-based Predictor
SLearn: A Novel Approach for Learning Runtime Properties
Learning in Time History vs Learning in Space SLearn
Comparing Prediction Accuracy: Large Scale Trace-based Analysis
Comparing Coefficients of Variations CoVs across Space and Time
Varying the History Length in CoV comparison
Comparing Prediction Overhead: Simulation and Testbed Experiments Using GS
SLearn's Implementation and Design
Baselines and Experimental Setup
Simulation and Testbed Experimental Results
SLearn for DAG and Future Work
SLearn Summary


Taught by

USENIX

Related Courses

Scaling Memcache at Facebook
USENIX via YouTube
Multi-Person Localization via RF Body Reflections
USENIX via YouTube
Opaque - An Oblivious and Encrypted Distributed Analytics Platform
USENIX via YouTube
Live Video Analytics at Scale with Approximation and Delay-Tolerance
USENIX via YouTube
Clipper - A Low-Latency Online Prediction Serving System
USENIX via YouTube