TopoOpt - Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
Offered By: USENIX via YouTube
Course Description
Overview
Explore a groundbreaking approach to optimizing distributed deep neural network (DNN) training in this 17-minute conference talk from NSDI '23. Dive into TopoOpt, a novel direct-connect fabric that co-optimizes computation, communication, and network topology for DNN training workloads. Learn how the researchers leverage the mutability of AllReduce traffic to construct efficient network topologies and employ an alternating optimization technique alongside a group theory-inspired algorithm called TotientPerms. Discover the implementation of a fully functional 12-node direct-connect prototype with remote direct memory access (RDMA) forwarding at 100 Gbps. Gain insights into large-scale simulations on real distributed training models, demonstrating how TopoOpt reduces DNN training time by up to 3.4x compared to similar-cost Fat-Tree interconnects.
Syllabus
NSDI '23 - TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed...
Taught by
USENIX
Related Courses
Online Master of Computer ScienceArizona State University via Coursera Blockchain Scalability and its Foundations in Distributed Systems
The University of Sydney via Coursera Blockchain Fundamentals: Understanding the Origins, Mechanisms, and Applications of Decentralized Systems
SDA Bocconi School of Management via edX Blockchain Technology
University of California, Berkeley via edX Building Globally Distributed Databases with Cosmos DB
Coursera Project Network via Coursera