Resiliency at Scale: Managing Google's TPUv4 Machine Learning Supercomputer

Offered By: USENIX via YouTube

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Explore the intricacies of managing Google's TPUv4 Machine Learning Supercomputer in this 18-minute conference talk from NSDI '24. Delve into the design and operation of the software infrastructure that enables TPUv4 supercomputers to function at scale, with a focus on automatic fault resiliency and hardware recovery features. Learn about the software-defined networking (SDN) approach used to manage the high-bandwidth inter-chip interconnect (ICI) fabric, including the use of optical circuit switching for dynamic route configuration to circumvent machine, chip, and link failures. Discover how the infrastructure detects failures, triggers automatic reconfigurations to minimize workload disruption, and initiates remediation and repair workflows for affected components. Gain insights into how similar techniques interface with maintenance and upgrade workflows for both hardware and software. Understand how this dynamic reconfiguration approach allows TPUv4 supercomputers to achieve 99.98% system availability, effectively handling hardware outages experienced by approximately 1% of training jobs.

Syllabus

NSDI '24 - Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer

Taught by

USENIX

Resiliency at Scale: Managing Google's TPUv4 Machine Learning Supercomputer

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Resiliency at Scale: Managing Google's TPUv4 Machine Learning Supercomputer

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Login to Continue