YoVDO

Training Foundation Model Workloads on Kubernetes at Scale with MCAD

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Kubernetes Courses Apache Spark Courses PyTorch Courses GPU Computing Courses Distributed Computing Courses Foundation Models Courses Cloud Native Computing Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore how IBM Research built Vela, a cloud-native AI supercomputer, to train foundational models on Kubernetes at scale. Learn about the challenges faced in supporting multiple frameworks like PyTorch, Ray, and Spark for diverse research teams. Discover the role of Multi-Cluster App Dispatcher (MCAD) in queuing custom resources for large-scale AI training, and its interaction with the underlying Kubernetes scheduler. Gain insights into the implementation of gang priority, gang preemption, and fault tolerance for training processes that span hundreds of GPUs and run for extended periods. This conference talk provides valuable information on scaling AI workloads in a Kubernetes environment for researchers and developers working with foundation models.

Syllabus

Training Foundation Model Workloads on Kubernetes at Scale W... Abhishek Malvankar & Olivier Tardieu


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

Моделирование биологических молекул на GPU (Biomolecular modeling on GPU)
Moscow Institute of Physics and Technology via Coursera
Practical Deep Learning For Coders
fast.ai via Independent
GPU Architectures And Programming
Indian Institute of Technology, Kharagpur via Swayam
Perform Real-Time Object Detection with YOLOv3
Coursera Project Network via Coursera
Getting Started with PyTorch
Coursera Project Network via Coursera