Optimize LLM Workflows with Smart Infrastructure Enhanced by Volcano
Offered By: CNCF [Cloud Native Computing Foundation] via YouTube
Course Description
Overview
Explore strategies for optimizing Large Language Model (LLM) workflows using smart infrastructure enhanced by Volcano in this informative conference talk. Discover how to effectively manage large-scale LLM training and inference platforms while addressing critical challenges such as training efficiency, fault tolerance, resource fragmentation, operational costs, and topology-aware scheduling. Learn about fault detection techniques, fast job recovery, and self-healing mechanisms that significantly improve efficiency. Gain insights into handling long downtime in LLM training on heterogeneous GPUs, implementing intelligent GPU workload scheduling to reduce resource fragmentation and costs, and leveraging topology-aware scheduling on rack/supernode systems to accelerate LLM training. Benefit from real-world experiences shared by the speakers in managing thousands of GPUs and handling monthly workloads involving numerous LLM training and inference jobs in a cloud-native AI platform environment.
Syllabus
Optimize LLM Workflows with Smart Infrastructure Enhanced by Volcano - Xin Li & Xuzheng Chang
Taught by
CNCF [Cloud Native Computing Foundation]
Related Courses
MongoDB for DBAsMongoDB University MongoDB Advanced Deployment and Operations
MongoDB University Building Cloud Apps with Microsoft Azure - Part 3
Microsoft via edX Implementing Microsoft Windows Server Disks and Volumes
Microsoft via edX Cloud Computing and Distributed Systems
Indian Institute of Technology Patna via Swayam