YoVDO

Scale and Accelerate Distributed Model Training in Kubernetes Clusters

Offered By: MLOps World: Machine Learning in Production via YouTube

Tags

Kubernetes Courses Machine Learning Courses PyTorch Courses MLOps Courses GPU Computing Courses Distributed Computing Courses Kubeflow Courses RDMA Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore how to scale and accelerate distributed model training in Kubernetes clusters in this 49-minute conference talk from MLOps World: Machine Learning in Production. Learn from Jack Jin, Lead ML Infrastructure Engineer at Zoom, as he shares insights on orchestrating Deep Learning workloads across multiple GPUs and nodes. Discover how Kubernetes and Kubeflow PytorchJob can be leveraged to schedule and track distributed training jobs on multi-GPU single-node and multi-GPU multi-node setups within a shared GPU resource pool. Gain knowledge about accelerating deep learning training at Zoom through the implementation of RDMA and RoCE technologies to bypass the CPU kernel and offload the TCP/IP protocol. Understand the application of these technologies in Kubernetes using SRIOV by NVIDIA Network Operator in heterogeneous GPU clusters, and learn how to achieve near-linear performance increases as GPU numbers and worker nodes scale up.

Syllabus

Scale and Accelerate the Distributed Model Training in Kubernetes Cluster


Taught by

MLOps World: Machine Learning in Production

Related Courses

Introduction to Cloud Infrastructure Technologies
Linux Foundation via edX
Scalable Microservices with Kubernetes
Google via Udacity
Google Cloud Fundamentals: Core Infrastructure
Google via Coursera
Introduction to Kubernetes
Linux Foundation via edX
Fundamentals of Containers, Kubernetes, and Red Hat OpenShift
Red Hat via edX