On-Demand Systems and Scaled Training Using the JobSet API
Offered By: CNCF [Cloud Native Computing Foundation] via YouTube
Course Description
Overview
Explore the JobSet API for orchestrating complex workflows in ephemeral environments through this informative conference talk. Discover how to efficiently manage large-scale machine learning model training and build on-demand HPC systems using this powerful tool. Learn about automating the setup of training workloads with common frameworks like PyTorch and see results from large-scale experiments utilizing thousands of TPU chips. Gain insights into streamlining the process of creating on-demand HPC systems and establishing standardized environments for experimental comparisons. Understand how the JobSet API addresses challenges in job orchestration, ensuring scalability and high resource utilization for heterogeneous components in cloud-native computing environments.
Syllabus
On-Demand Systems and Scaled Training Using the JobSet API - Abdullah Gharaibeh & Vanessa Sochat
Taught by
CNCF [Cloud Native Computing Foundation]
Related Courses
Production Machine Learning SystemsGoogle Cloud via Coursera Deep Learning
Kaggle via YouTube All About AI Accelerators - GPU, TPU, Dataflow, Near-Memory, Optical, Neuromorphic & More
Yannic Kilcher via YouTube Machine Learning with JAX - From Hero to HeroPro+
Aleksa Gordić - The AI Epiphany via YouTube PyTorch NLP Model Training and Fine-Tuning on Colab TPU Multi-GPU with Accelerate
1littlecoder via YouTube