YoVDO

PANAMA: In-Network Aggregation for Shared Machine Learning Clusters

Offered By: MLOps World: Machine Learning in Production via YouTube

Tags

Machine Learning Courses FPGA Courses Congestion Control Courses Distributed Computing Courses Load Balancing Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore PANAMA, a groundbreaking in-network aggregation framework designed for distributed machine learning training on shared clusters. Delve into the two key components of this innovative system: a custom in-network hardware accelerator supporting floating-point gradient aggregation at line rate without compromising accuracy, and a lightweight load-balancing and congestion control protocol. Discover how PANAMA exploits unique communication patterns of ML data-parallel jobs to enable fair sharing of network resources while ensuring high throughput for long-running jobs and low latency for short jobs and latency-sensitive traffic. Examine the feasibility of PANAMA through an FPGA-based prototype with 10 Gbps transceivers and large-scale simulations. Learn how this framework decreases the average training time of large jobs by up to a factor of 1.34 and significantly benefits non-aggregation flows by reducing their 99%-tile completion time by up to 4.5x. Gain insights from Nadeen Gebara, a Ph.D. Student at Imperial College of London, as she presents this cutting-edge research in machine learning infrastructure optimization.

Syllabus

PANAMA In network Aggregation for Shared Machine Learning Clusters


Taught by

MLOps World: Machine Learning in Production

Related Courses

Computer Networking
Georgia Institute of Technology via Udacity
Cloud Networking
University of Illinois at Urbana-Champaign via Coursera
Packet Switching Networks and Algorithms
University of Colorado System via Coursera
TCP/IP and Advanced Topics
University of Colorado System via Coursera
Master Class : TCP/IP Mechanics from Scratch to Expert
Udemy