YoVDO

Production Multi-node Jobs with Gang Scheduling, K8s, GPUs and RDMA

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Conference Talks Courses Deep Learning Courses Kubernetes Courses PyTorch Courses MPI Courses GPU Acceleration Courses RDMA Courses Horovod Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore production multi-node job execution with gang scheduling, Kubernetes, GPUs, and RDMA in this conference talk from KubeCon + CloudNativeCon. Dive into the challenges and solutions for running distributed deep learning and machine learning workloads in shared Kubernetes clusters. Learn about distributed TensorFlow, PyTorch, Horovod, and MPI implementations, as well as the use of GPU nodes with NCCL and RDMA for accelerated performance. Discover the end-to-end flow for multi-node jobs in Kubernetes, including gang scheduling, quotas, fairness, and backfilling implemented in a custom GPU scheduler. Gain insights into high-speed networking through RoCE and SR-IOV/Multus CNI, and understand design choices, learnings, and operational experiences, including failure handling, performance optimization, and telemetry in large-scale distributed computing environments.

Syllabus

Intro
Deep Learning Applications
AL/DL: Models, Frameworks, Hardware
Trends: Big Data, Larger Models
Sample Multi-GPU Node: DGX-1
Distributed Training Applications Multi-GPU, Multi-node
K8s Challenges & Outline
Kes Orchestration Flow
Sample PyTorch Job Launch
Array Jobs and MPI Operator
SRIOV CNI for K8s Multi-Rail
Gang Scheduling Multi-Node Pods
PodGroup Queue and Manager
Demo
Sample Job Real-Time Telemetry
Sample BERT K8s Scaling
Shared K8s Cluster for Multi-node
Scheduler Dashboard
Summary and Future Work


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

Windows Server 2019: Advanced Networking Features
LinkedIn Learning
Deep Dive into GPU Support in Apache Spark 3.x - Accelerator-Aware Scheduling and RAPIDS Plugin
Databricks via YouTube
Microsecond Consensus for Microsecond Applications
USENIX via YouTube
An Edge-Queued Datagram Service for All Datacenter Traffic
USENIX via YouTube
Building a High Performance Network in the Public Cloud Using RDMA - First Principles
Oracle via YouTube