YoVDO

ML Training Acceleration with Heterogeneous Resources in ByteDance

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Machine Learning Courses Kubernetes Courses Distributed Systems Courses GPU Acceleration Courses Heterogeneous Computing Courses RDMA Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore machine learning training acceleration techniques using heterogeneous resources at ByteDance in this 19-minute conference talk from KubeCon + CloudNativeCon Europe 2022. Delve into strategies for maximizing GPU utilization through sharing mechanisms, optimizing resource allocation with NUMA affinity, and implementing high-throughput network communication using RDMA CNI and Intel SRIOV technology. Gain insights into empowering model training, enhancing performance for large-scale distributed models, and effectively managing diverse CPU/GPU resources. Cover topics including GPU offline training for network and scheduling, GPU online serving, unified GPU scheduling, and future developments in the field.

Syllabus

Intro
GPU Offline Training (Network)
GPU Offline Training (Scheduling).
GPU Online Serving
GPU Unified Scheduling
Future Work


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

Windows Server 2019: Advanced Networking Features
LinkedIn Learning
Deep Dive into GPU Support in Apache Spark 3.x - Accelerator-Aware Scheduling and RAPIDS Plugin
Databricks via YouTube
Microsecond Consensus for Microsecond Applications
USENIX via YouTube
An Edge-Queued Datagram Service for All Datacenter Traffic
USENIX via YouTube
Building a High Performance Network in the Public Cloud Using RDMA - First Principles
Oracle via YouTube