YoVDO

Alpa - Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Offered By: USENIX via YouTube

Tags

OSDI (Operating Systems Design and Implementation) Courses Distributed Deep Learning Courses

Course Description

Overview

Explore an innovative approach to automating model-parallel training for large deep learning models in this 18-minute conference talk from OSDI '22. Discover how Alpa generates execution plans that unify data, operator, and pipeline parallelism, addressing the limitations of existing model-parallel training systems. Learn about Alpa's hierarchical view of parallelisms, its new space for massive model-parallel execution plans, and the compilation passes designed to derive efficient parallel execution plans. Understand how Alpa's runtime orchestrates two-level parallel execution on distributed compute devices, and examine its performance compared to hand-tuned systems. Gain insights into Alpa's versatility in handling models with heterogeneous architectures and those without manually-designed plans. Access the source code and explore the potential of this groundbreaking approach to scaling out complex deep learning models on distributed systems.

Syllabus

OSDI '22 - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning


Taught by

USENIX

Related Courses

Challenges and Opportunities in Applying Machine Learning - Alex Jaimes - ODSC East 2018
Open Data Science via YouTube
Efficient Distributed Deep Learning Using MXNet
Simons Institute via YouTube
Benchmarks and How-Tos for Convolutional Neural Networks on HorovodRunner-Enabled Apache Spark Clusters
Databricks via YouTube
SHADE - Enable Fundamental Cacheability for Distributed Deep Learning Training
USENIX via YouTube
Horovod - Distributed Deep Learning for Reliable MLOps
Linux Foundation via YouTube