YoVDO

Leverage Topology Modeling and Topology-Aware Scheduling to Accelerate LLM Training

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Kubernetes Courses High Performance Computing Courses Distributed Training Courses NUMA Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore how to leverage topology modeling and topology-aware scheduling to accelerate Large Language Model (LLM) training in this 45-minute conference talk by William Wang from Huawei at CNCF. Delve into the shift from computing to network bottlenecks in the LLM training and inference era, examining high-throughput and low-latency interconnect technologies like nvlink and nvswitch used in hyper-computers. Analyze the impact of inter-node communication and intra-node resource interconnects on AI workload performance, particularly for large language model training. Learn how to model topology on underlying resources such as NUMA, Rack, Super Pod, and Hyper Computer. Discover techniques for making schedulers topology-aware to optimize resource allocation and performance. Investigate methods to coordinate topology-aware scheduling with Device Resource Aggregation (DRA) on nodes, addressing Kubernetes' current limitations in efficiently handling topology awareness for AI workloads.

Syllabus

Leverage Topology Modeling and Topology-Aware Scheduling to Accelerate LLM Training - William Wang


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

Hyper-V on Windows Server 2016 and Windows 10
Udemy
Performance Tuning Red Hat Enterprise Linux Platform for Databases
Red Hat via YouTube
Java 9 - The Quest for Very Large Heaps
Devoxx via YouTube
Under the Hood of a Shard-per-Core Database Architecture
Linux Foundation via YouTube
Comparing Performance of NVMe Hard Drives in KVM, Baremetal, and Docker - Using Fio and SPDK
Linux Foundation via YouTube