YoVDO

Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks

Offered By: USENIX via YouTube

Tags

USENIX Symposium on Networked Systems Design and Implementation (NSDI) Courses Deep Learning Courses Performance Evaluation Courses Supercomputing Courses Horovod Courses

Course Description

Overview

Explore a 17-minute conference talk from USENIX NSDI '22 that delves into accelerating collective communication in data parallel training across deep learning frameworks. Learn about new techniques developed within Horovod, a generic communication library, to improve the control plane and enhance performance in large-scale distributed training. Discover how the researchers implemented a caching strategy and decentralized orchestration to optimize the coordinator-worker logic, and introduced a feature for users to group collective operations for finer control over communication buffer sizes. Examine the experimental results conducted on the Summit supercomputer, comparing the proposed strategies against Horovod's original design, tf.distribute, torch.DDP, and BytePS. Gain insights into the impressive performance improvements achieved, including a 2x speedup at 6000 GPUs scale and near-linear scaling of 0.93 with 1.54 exaflops sustained performance using 27,600 GPUs on a scientific application (STEMDL).

Syllabus

NSDI '22 - Accelerating Collective Communication in Data Parallel Training across Deep Learning...


Taught by

USENIX

Related Courses

Automated Background Removal Using PyTorch - Wehkamp's Image Processing Pipeline
Databricks via YouTube
Machine Learning Using Kubeflow and Kubernetes
Devoxx via YouTube
Building Deep Learning Models on Databricks
Pluralsight
Democratizing Deep Learning at Scale with Horovod
Linux Foundation via YouTube
Efficient Data Parallel Distributed Training with Flyte, Spark and Horovod
Linux Foundation via YouTube