PANAMA: In-Network Aggregation for Shared Machine Learning Clusters
Offered By: MLOps World: Machine Learning in Production via YouTube
Course Description
Overview
Explore PANAMA, a groundbreaking in-network aggregation framework designed for distributed machine learning training on shared clusters. Delve into the two key components of this innovative system: a custom in-network hardware accelerator supporting floating-point gradient aggregation at line rate without compromising accuracy, and a lightweight load-balancing and congestion control protocol. Discover how PANAMA exploits unique communication patterns of ML data-parallel jobs to enable fair sharing of network resources while ensuring high throughput for long-running jobs and low latency for short jobs and latency-sensitive traffic. Examine the feasibility of PANAMA through an FPGA-based prototype with 10 Gbps transceivers and large-scale simulations. Learn how this framework decreases the average training time of large jobs by up to a factor of 1.34 and significantly benefits non-aggregation flows by reducing their 99%-tile completion time by up to 4.5x. Gain insights from Nadeen Gebara, a Ph.D. Student at Imperial College of London, as she presents this cutting-edge research in machine learning infrastructure optimization.
Syllabus
PANAMA In network Aggregation for Shared Machine Learning Clusters
Taught by
MLOps World: Machine Learning in Production
Related Courses
Cloud Computing Concepts, Part 1University of Illinois at Urbana-Champaign via Coursera Cloud Computing Concepts: Part 2
University of Illinois at Urbana-Champaign via Coursera Reliable Distributed Algorithms - Part 1
KTH Royal Institute of Technology via edX Introduction to Apache Spark and AWS
University of London International Programmes via Coursera Réalisez des calculs distribués sur des données massives
CentraleSupélec via OpenClassrooms