YoVDO

KungFu - Making Training in Distributed Machine Learning Adaptive

Offered By: USENIX via YouTube

Tags

OSDI (Operating Systems Design and Implementation) Courses Distributed Machine Learning Courses

Course Description

Overview

Explore the innovative KungFu distributed machine learning library for TensorFlow, designed to enable adaptive training in this OSDI '20 conference talk. Dive into the challenges of configuring numerous parameters in distributed ML systems and discover how KungFu addresses these issues through high-level Adaptation Policies (APs). Learn about the library's ability to dynamically adjust hyper-parameters and system parameters during training based on real-time monitored metrics. Understand the implementation of monitoring and control operators embedded in the dataflow graph, and the efficient asynchronous collective communication layer that ensures concurrency and consistency. Gain insights into the effectiveness of KungFu's adaptive approach, its mechanisms for distributed parameter adaptation, and the potential impact on improving the efficiency and performance of distributed machine learning training.

Syllabus

Intro
Training in Distributed ML Systems
Parameters in Distributed ML Systems
Issues with Empirical Parameter Tuning
Proposals for Automatic Parameter Adaptation
Open Challenges
Existing Approaches for Adaptation
KungFu Overview
Adaptation Policies
Example: Adaptation Policy for GNS
Embedding Monitoring Inside Dataflow Problem: High monitoring cost reduces adaptation benefit Idea: Improve efficiency by adding monitoring operators to dataflow graph
Challenges of Dataflow Collective Communication
Making Collective Communication Asynchronous Idea: Use asynchronous collective communication
Issues When Adapting System Parameters
Distributed Mechanism for Parameter Adaptation
How Effectively Does KungFu Adapt?
Conclusions: Kung Fu


Taught by

USENIX

Related Courses

GraphX - Graph Processing in a Distributed Dataflow Framework
USENIX via YouTube
Theseus - An Experiment in Operating System Structure and State Management
USENIX via YouTube
RedLeaf - Isolation and Communication in a Safe Operating System
USENIX via YouTube
Microsecond Consensus for Microsecond Applications
USENIX via YouTube
Caladan - Mitigating Interference at Microsecond Timescales
USENIX via YouTube