PANAMA: In-Network Aggregation for Shared Machine Learning Clusters

Offered By: MLOps World: Machine Learning in Production via YouTube

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Explore PANAMA, a groundbreaking in-network aggregation framework designed for distributed machine learning training on shared clusters. Delve into the two key components of this innovative system: a custom in-network hardware accelerator supporting floating-point gradient aggregation at line rate without compromising accuracy, and a lightweight load-balancing and congestion control protocol. Discover how PANAMA exploits unique communication patterns of ML data-parallel jobs to enable fair sharing of network resources while ensuring high throughput for long-running jobs and low latency for short jobs and latency-sensitive traffic. Examine the feasibility of PANAMA through an FPGA-based prototype with 10 Gbps transceivers and large-scale simulations. Learn how this framework decreases the average training time of large jobs by up to a factor of 1.34 and significantly benefits non-aggregation flows by reducing their 99%-tile completion time by up to 4.5x. Gain insights from Nadeen Gebara, a Ph.D. Student at Imperial College of London, as she presents this cutting-edge research in machine learning infrastructure optimization.

Syllabus

PANAMA In network Aggregation for Shared Machine Learning Clusters

Taught by

MLOps World: Machine Learning in Production

PANAMA: In-Network Aggregation for Shared Machine Learning Clusters

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

PANAMA: In-Network Aggregation for Shared Machine Learning Clusters

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Login to Continue