YoVDO

SuperBench - Improving Cloud AI Infrastructure Reliability with Proactive Validation

Offered By: USENIX via YouTube

Tags

Cloud Computing Courses Artificial Intelligence Courses Benchmarking Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a groundbreaking conference talk on improving cloud AI infrastructure reliability through proactive validation. Delve into the innovative SuperBench system, designed to mitigate hidden degradation caused by hardware redundancies in cloud AI environments. Learn about the comprehensive benchmark suite that evaluates individual hardware components and represents real AI workloads. Discover how the Validator component uses machine learning to identify defective components, while the Selector optimizes validation timing and benchmark selection. Examine the impressive results from testbed evaluations and simulations, showcasing SuperBench's ability to significantly increase mean time between incidents. Gain insights into the successful deployment of SuperBench in Azure production, validating hundreds of thousands of GPUs over a two-year period. Understand the critical importance of addressing "gray failures" in cloud AI infrastructure and how SuperBench contributes to enhanced overall reliability for cloud service providers.

Syllabus

USENIX ATC '24 - SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation


Taught by

USENIX

Related Courses

Software as a Service
University of California, Berkeley via Coursera
Software Defined Networking
Georgia Institute of Technology via Coursera
Pattern-Oriented Software Architectures: Programming Mobile Services for Android Handheld Systems
Vanderbilt University via Coursera
Web-Technologien
openHPI
Données et services numériques, dans le nuage et ailleurs
Certificat informatique et internet via France Université Numerique