YoVDO

Building a 5-Exaflop Supercomputer for Meta-AI Research and Large-Scale Model Training

Offered By: USENIX via YouTube

Tags

High Performance Computing Courses Distributed Systems Courses GPU Computing Courses Supercomputers Courses Infiniband Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the architecture, construction, and operation of Meta's cutting-edge AI Research SuperCluster in this 35-minute conference talk from SREcon23 Europe/Middle East/Africa. Discover how a small, geographically distributed team of Software and Production Engineers (SRE) collaborated to build and manage one of the world's largest AI supercomputers, boasting 16,000 GPUs and 5 exaflops of compute power. Gain insights into the challenges and solutions involved in supporting large-scale model training, including the recently released Llama series, with a focus on the InfiniBand interconnect, high-performance storage systems, and emerging monitoring and observability needs. Learn valuable lessons from Meta's experience in pushing the boundaries of AI research infrastructure and team collaboration.

Syllabus

SREcon23 Europe/Middle East/Africa - Building a 5-Exaflop Supercomputer for Meta-AI Research and...


Taught by

USENIX

Related Courses

Моделирование биологических молекул на GPU (Biomolecular modeling on GPU)
Moscow Institute of Physics and Technology via Coursera
Practical Deep Learning For Coders
fast.ai via Independent
GPU Architectures And Programming
Indian Institute of Technology, Kharagpur via Swayam
Perform Real-Time Object Detection with YOLOv3
Coursera Project Network via Coursera
Getting Started with PyTorch
Coursera Project Network via Coursera