ML Training Acceleration with Heterogeneous Resources in ByteDance
Offered By: CNCF [Cloud Native Computing Foundation] via YouTube
Course Description
Overview
Explore machine learning training acceleration techniques using heterogeneous resources at ByteDance in this 19-minute conference talk from KubeCon + CloudNativeCon Europe 2022. Delve into strategies for maximizing GPU utilization through sharing mechanisms, optimizing resource allocation with NUMA affinity, and implementing high-throughput network communication using RDMA CNI and Intel SRIOV technology. Gain insights into empowering model training, enhancing performance for large-scale distributed models, and effectively managing diverse CPU/GPU resources. Cover topics including GPU offline training for network and scheduling, GPU online serving, unified GPU scheduling, and future developments in the field.
Syllabus
Intro
GPU Offline Training (Network)
GPU Offline Training (Scheduling).
GPU Online Serving
GPU Unified Scheduling
Future Work
Taught by
CNCF [Cloud Native Computing Foundation]
Related Courses
Future of Computing - IBM Power 9 and beyondopenHPI SIGCOMM 2020 - Reducto - On-Camera Filtering for Resource-Efficient Real-Time Video Analytics
Association for Computing Machinery (ACM) via YouTube Offload Annotations - Bringing Heterogeneous Computing to Existing Libraries and Workloads
USENIX via YouTube Supercomputing Spotlights - Supercomputing Software for Moore and Beyond
Society for Industrial and Applied Mathematics via YouTube Liquid Metal - Taming Heterogeneity
GOTO Conferences via YouTube