Training Large Language Models on Kubernetes
Offered By: CNCF [Cloud Native Computing Foundation] via YouTube
Course Description
Overview
Explore the challenges and best practices of training Large Language Models (LLMs) on Kubernetes in this informative conference talk. Discover how to optimize networking, manage distributed resources, schedule effectively, and manipulate code for LLM training on K8s. Learn about pre-made configurations, data pre-processing workflows, and training setups based on NVIDIA's Megatron Transformer framework to quickly start LLM training on Kubernetes. Compare training throughput between bare metal and K8s-based environments for models like GPT, T5, and BERT across various GPU node configurations. Gain insights into the massive computational requirements of LLMs and how Kubernetes can be leveraged for their training, as opposed to traditional bare metal servers with high-performance computing workload schedulers like Slurm.
Syllabus
Training Large Language Models on Kubernetes - Ronen Dar, Run:ai
Taught by
CNCF [Cloud Native Computing Foundation]
Related Courses
High Performance ComputingGeorgia Institute of Technology via Udacity Введение в параллельное программирование с использованием OpenMP и MPI
Tomsk State University via Coursera High Performance Computing in the Cloud
Dublin City University via FutureLearn Production Machine Learning Systems
Google Cloud via Coursera LAFF-On Programming for High Performance
The University of Texas at Austin via edX