YoVDO

TopoOpt - Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs

Offered By: USENIX via YouTube

Tags

USENIX Symposium on Networked Systems Design and Implementation (NSDI) Courses Distributed Systems Courses Algorithm Design Courses Distributed Training Courses

Course Description

Overview

Explore a groundbreaking approach to optimizing distributed deep neural network (DNN) training in this 17-minute conference talk from NSDI '23. Dive into TopoOpt, a novel direct-connect fabric that co-optimizes computation, communication, and network topology for DNN training workloads. Learn how the researchers leverage the mutability of AllReduce traffic to construct efficient network topologies and employ an alternating optimization technique alongside a group theory-inspired algorithm called TotientPerms. Discover the implementation of a fully functional 12-node direct-connect prototype with remote direct memory access (RDMA) forwarding at 100 Gbps. Gain insights into large-scale simulations on real distributed training models, demonstrating how TopoOpt reduces DNN training time by up to 3.4x compared to similar-cost Fat-Tree interconnects.

Syllabus

NSDI '23 - TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed...


Taught by

USENIX

Related Courses

Scaling Memcache at Facebook
USENIX via YouTube
Multi-Person Localization via RF Body Reflections
USENIX via YouTube
Opaque - An Oblivious and Encrypted Distributed Analytics Platform
USENIX via YouTube
Live Video Analytics at Scale with Approximation and Delay-Tolerance
USENIX via YouTube
Clipper - A Low-Latency Online Prediction Serving System
USENIX via YouTube