YoVDO

High Performance Networking for Distributed DL Training in Production K8s

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Conference Talks Courses Kubernetes Courses Cluster Architecture Courses Distributed Deep Learning Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the intricacies of high-performance networking for distributed deep learning training in production Kubernetes environments in this 25-minute conference talk. Delve into the design and architecture of an 800 GPU cluster interconnected over RoCE fabric, achieving line rate performance between communicating containers in multi-node jobs. Learn about scalable cookie-cutter POD design for data centers, low latency one-hop network design enabling NCCL rings to avoid output port congestion, and Kubernetes integration with multi-homed networks for optimal GPU utilization. Gain insights into performance numbers for training workloads from production clusters, and discover how to overcome bottlenecks at NIC and switching fabric acting as interconnects between nodes.

Syllabus

High Performance Networking for Distributed DL Training in Production K8s - Nivedita Viswanath


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

Building Geospatial Apps on Postgres, PostGIS, & Citus at Large Scale
Microsoft via YouTube
Unlocking the Power of ML for Your JavaScript Applications with TensorFlow.js
TensorFlow via YouTube
Managing the Reactive World with RxJava - Jake Wharton
ChariotSolutions via YouTube
What's New in Grails 2.0
ChariotSolutions via YouTube
Performance Analysis of Apache Spark and Presto in Cloud Environments
Databricks via YouTube