YoVDO

Detecting and Overcoming GPU Failures During ML Training

Offered By: Linux Foundation via YouTube

Tags

Machine Learning Courses Cloud Computing Courses Fault Tolerance Courses Observability Courses Distributed Training Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore strategies for detecting and overcoming GPU failures during machine learning training in this 43-minute conference talk by Ganeshkumar Ashokavardhanan from Microsoft and Sarah Belghiti from Wayve. Delve into the challenges of GPU failures in the context of ML training, particularly distributed training, as model sizes and training scales increase. Discover the spectrum of GPU issues and learn why even minor performance drops can significantly impact large jobs. Gain insights into using observability tools like NVIDIA DCGM for proactive problem detection through GPU health checks. Understand the principles of fault-tolerant distributed training to mitigate the impact of GPU failures. Drawing from cloud provider and autonomous vehicle company experiences, learn best practices for efficient identification, remediation, and prevention of GPU failures. Explore cutting-edge ideas such as CRIU and task pre-emption for GPU workloads to enhance training resilience and efficiency.

Syllabus

Detecting & Overcoming GPU Failures During ML Training- Ganeshkumar Ashokavardhanan & Sarah Belghiti


Taught by

Linux Foundation

Tags

Related Courses

Software as a Service
University of California, Berkeley via Coursera
Software Defined Networking
Georgia Institute of Technology via Coursera
Pattern-Oriented Software Architectures: Programming Mobile Services for Android Handheld Systems
Vanderbilt University via Coursera
Web-Technologien
openHPI
Données et services numériques, dans le nuage et ailleurs
Certificat informatique et internet via France Université Numerique