Resiliency at Scale: Managing Google's TPUv4 Machine Learning Supercomputer
Offered By: USENIX via YouTube
Course Description
Overview
Explore the intricacies of managing Google's TPUv4 Machine Learning Supercomputer in this 18-minute conference talk from NSDI '24. Delve into the design and operation of the software infrastructure that enables TPUv4 supercomputers to function at scale, with a focus on automatic fault resiliency and hardware recovery features. Learn about the software-defined networking (SDN) approach used to manage the high-bandwidth inter-chip interconnect (ICI) fabric, including the use of optical circuit switching for dynamic route configuration to circumvent machine, chip, and link failures. Discover how the infrastructure detects failures, triggers automatic reconfigurations to minimize workload disruption, and initiates remediation and repair workflows for affected components. Gain insights into how similar techniques interface with maintenance and upgrade workflows for both hardware and software. Understand how this dynamic reconfiguration approach allows TPUv4 supercomputers to achieve 99.98% system availability, effectively handling hardware outages experienced by approximately 1% of training jobs.
Syllabus
NSDI '24 - Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer
Taught by
USENIX
Related Courses
High Performance ComputingGeorgia Institute of Technology via Udacity Введение в параллельное программирование с использованием OpenMP и MPI
Tomsk State University via Coursera High Performance Computing in the Cloud
Dublin City University via FutureLearn Production Machine Learning Systems
Google Cloud via Coursera LAFF-On Programming for High Performance
The University of Texas at Austin via edX