YoVDO

Resiliency at Scale: Managing Google's TPUv4 Machine Learning Supercomputer

Offered By: USENIX via YouTube

Tags

Machine Learning Courses Fault Tolerance Courses High Performance Computing Courses Supercomputers Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the intricacies of managing Google's TPUv4 Machine Learning Supercomputer in this 18-minute conference talk from NSDI '24. Delve into the design and operation of the software infrastructure that enables TPUv4 supercomputers to function at scale, with a focus on automatic fault resiliency and hardware recovery features. Learn about the software-defined networking (SDN) approach used to manage the high-bandwidth inter-chip interconnect (ICI) fabric, including the use of optical circuit switching for dynamic route configuration to circumvent machine, chip, and link failures. Discover how the infrastructure detects failures, triggers automatic reconfigurations to minimize workload disruption, and initiates remediation and repair workflows for affected components. Gain insights into how similar techniques interface with maintenance and upgrade workflows for both hardware and software. Understand how this dynamic reconfiguration approach allows TPUv4 supercomputers to achieve 99.98% system availability, effectively handling hardware outages experienced by approximately 1% of training jobs.

Syllabus

NSDI '24 - Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer


Taught by

USENIX

Related Courses

High Performance Computing
Georgia Institute of Technology via Udacity
Введение в параллельное программирование с использованием OpenMP и MPI
Tomsk State University via Coursera
High Performance Computing in the Cloud
Dublin City University via FutureLearn
Production Machine Learning Systems
Google Cloud via Coursera
LAFF-On Programming for High Performance
The University of Texas at Austin via edX