Advanced Resource Management for Running AI/ML Workloads with Kueue
Offered By: CNCF [Cloud Native Computing Foundation] via YouTube
Course Description
Overview
Explore advanced resource management techniques for AI/ML workloads using Kueue in this informative conference talk. Dive into Kueue's architecture and learn how to implement quota- and priority-based resource sharing among multiple teams on Kubernetes. Understand the decision-making process behind Kueue's scheduler for starting and stopping jobs. Gain insights from a real-world production use case at CyberAgent, where Kueue serves as a crucial component in a multi-tenant system supporting various engineers and ML research teams. Discover how Kueue manages different job types and ML frameworks across multiple CPU and GPU configurations. Address the challenge of running ML training jobs requiring all pods to be scheduled, and explore solutions using Kueue in both static and autoscaling environments with the new ProvisioningRequest API.
Syllabus
Advanced Resource Management for Running AI/ML Workloads with Kueue - Michał Woźniak, Yuki Iwai
Taught by
CNCF [Cloud Native Computing Foundation]
Related Courses
SIG Scheduling Deep Dive in Kubernetes - Latest Enhancements and OpportunitiesCNCF [Cloud Native Computing Foundation] via YouTube Kubernetes WG Batch: Recent Improvements and Future Roadmap
CNCF [Cloud Native Computing Foundation] via YouTube Building a Batch System for the Cloud with Kueue
CNCF [Cloud Native Computing Foundation] via YouTube Kueue: Kubernetes-Native Job Queueing for Batch Workloads
CNCF [Cloud Native Computing Foundation] via YouTube Sailing Ray Workloads with KubeRay and Kueue in Kubernetes
CNCF [Cloud Native Computing Foundation] via YouTube