High Performance Networking for Distributed DL Training in Production K8s
Offered By: CNCF [Cloud Native Computing Foundation] via YouTube
Course Description
Overview
Explore the intricacies of high-performance networking for distributed deep learning training in production Kubernetes environments in this 25-minute conference talk. Delve into the design and architecture of an 800 GPU cluster interconnected over RoCE fabric, achieving line rate performance between communicating containers in multi-node jobs. Learn about scalable cookie-cutter POD design for data centers, low latency one-hop network design enabling NCCL rings to avoid output port congestion, and Kubernetes integration with multi-homed networks for optimal GPU utilization. Gain insights into performance numbers for training workloads from production clusters, and discover how to overcome bottlenecks at NIC and switching fabric acting as interconnects between nodes.
Syllabus
High Performance Networking for Distributed DL Training in Production K8s - Nivedita Viswanath
Taught by
CNCF [Cloud Native Computing Foundation]
Related Courses
Challenges and Opportunities in Applying Machine Learning - Alex Jaimes - ODSC East 2018Open Data Science via YouTube Efficient Distributed Deep Learning Using MXNet
Simons Institute via YouTube Benchmarks and How-Tos for Convolutional Neural Networks on HorovodRunner-Enabled Apache Spark Clusters
Databricks via YouTube SHADE - Enable Fundamental Cacheability for Distributed Deep Learning Training
USENIX via YouTube Alpa - Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
USENIX via YouTube