YoVDO

PANAMA: In-Network Aggregation for Shared Machine Learning Clusters

Offered By: MLOps World: Machine Learning in Production via YouTube

Tags

Machine Learning Courses FPGA Courses Congestion Control Courses Distributed Computing Courses Load Balancing Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore PANAMA, a groundbreaking in-network aggregation framework designed for distributed machine learning training on shared clusters. Delve into the two key components of this innovative system: a custom in-network hardware accelerator supporting floating-point gradient aggregation at line rate without compromising accuracy, and a lightweight load-balancing and congestion control protocol. Discover how PANAMA exploits unique communication patterns of ML data-parallel jobs to enable fair sharing of network resources while ensuring high throughput for long-running jobs and low latency for short jobs and latency-sensitive traffic. Examine the feasibility of PANAMA through an FPGA-based prototype with 10 Gbps transceivers and large-scale simulations. Learn how this framework decreases the average training time of large jobs by up to a factor of 1.34 and significantly benefits non-aggregation flows by reducing their 99%-tile completion time by up to 4.5x. Gain insights from Nadeen Gebara, a Ph.D. Student at Imperial College of London, as she presents this cutting-edge research in machine learning infrastructure optimization.

Syllabus

PANAMA In network Aggregation for Shared Machine Learning Clusters


Taught by

MLOps World: Machine Learning in Production

Related Courses

Designing Highly Scalable Web Apps on Google Cloud Platform
Google via Coursera
Google Cloud Platform for AWS Professionals
Google via Coursera
Elastic Google Cloud Infrastructure: Scaling and Automation
Google Cloud via Coursera
Windows Server 2016: Advanced Virtualization
Microsoft via edX
Elastic Cloud Infrastructure: Scaling and Automation 日本語版
Google Cloud via Coursera