YoVDO

Practice of Building AI Training Clusters Based on Kubernetes and RoCEv2

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Kubernetes Courses High Performance Computing Courses GPU Computing Courses Network Virtualization Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the practice of building AI training clusters using Kubernetes and RoCEv2 in this 42-minute conference talk. Learn how to integrate RoCEv2 lossless networks into Kubernetes, utilize RoCEv2 networks in Kubernetes pods, optimize resource scheduling for nodes with multiple GPUs and RoCE network cards, and make necessary adjustments to AI training tasks. Discover solutions for challenges such as network card virtualization, RoCE lossless network configuration, and running training tasks based on RoCEv2 and Kubernetes. Gain insights into the advantages and implementation strategies of using RoCEv2 networks over traditional Infiniband networks for AI cluster construction.

Syllabus

Practice of Building AI Training Cluster Based on Kubernetes+RoCEv2 - Wang DeKui & Wang Chao IEI


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

High Performance Computing
Georgia Institute of Technology via Udacity
Введение в параллельное программирование с использованием OpenMP и MPI
Tomsk State University via Coursera
High Performance Computing in the Cloud
Dublin City University via FutureLearn
Production Machine Learning Systems
Google Cloud via Coursera
LAFF-On Programming for High Performance
The University of Texas at Austin via edX