YoVDO

Advanced Resource Management for Running AI/ML Workloads with Kueue

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Kubernetes Courses GPU Computing Courses Kueue Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore advanced resource management techniques for AI/ML workloads using Kueue in this informative conference talk. Dive into Kueue's architecture and learn how to implement quota- and priority-based resource sharing among multiple teams on Kubernetes. Understand the decision-making process behind Kueue's scheduler for starting and stopping jobs. Gain insights from a real-world production use case at CyberAgent, where Kueue serves as a crucial component in a multi-tenant system supporting various engineers and ML research teams. Discover how Kueue manages different job types and ML frameworks across multiple CPU and GPU configurations. Address the challenge of running ML training jobs requiring all pods to be scheduled, and explore solutions using Kueue in both static and autoscaling environments with the new ProvisioningRequest API.

Syllabus

Advanced Resource Management for Running AI/ML Workloads with Kueue - Michał Woźniak, Yuki Iwai


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

Моделирование биологических молекул на GPU (Biomolecular modeling on GPU)
Moscow Institute of Physics and Technology via Coursera
LLM Server
Pragmatic AI Labs via edX
AI Infrastructure and Operations Fundamentals
Nvidia via Coursera
Open Source LLMOps Solutions
Duke University via Coursera
Deep Learning - Computer Vision for Beginners Using PyTorch
Packt via Coursera