YoVDO

Building a 5-Exaflop Supercomputer for Meta-AI Research and Large-Scale Model Training

Offered By: USENIX via YouTube

Tags

High Performance Computing Courses Distributed Systems Courses GPU Computing Courses Supercomputers Courses Infiniband Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the architecture, construction, and operation of Meta's cutting-edge AI Research SuperCluster in this 35-minute conference talk from SREcon23 Europe/Middle East/Africa. Discover how a small, geographically distributed team of Software and Production Engineers (SRE) collaborated to build and manage one of the world's largest AI supercomputers, boasting 16,000 GPUs and 5 exaflops of compute power. Gain insights into the challenges and solutions involved in supporting large-scale model training, including the recently released Llama series, with a focus on the InfiniBand interconnect, high-performance storage systems, and emerging monitoring and observability needs. Learn valuable lessons from Meta's experience in pushing the boundaries of AI research infrastructure and team collaboration.

Syllabus

SREcon23 Europe/Middle East/Africa - Building a 5-Exaflop Supercomputer for Meta-AI Research and...


Taught by

USENIX

Related Courses

The World of 100G Networking
Linux Foundation via YouTube
The Fundamentals of RDMA Programming
Nvidia via Coursera
Chameleon: Expanding Open-Source Ambari for HPC
Linux Foundation via YouTube
Serverless Kubernetes Boosts AI Business
CNCF [Cloud Native Computing Foundation] via YouTube
A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network
USENIX via YouTube