Bagua - Lightweight Distributed Learning on Kubernetes
Offered By: CNCF [Cloud Native Computing Foundation] via YouTube
Course Description
Overview
Explore a conference talk on Bagua, a lightweight distributed learning framework for Kubernetes developed by Kuaishou Technology and ETH Zürich. Discover how Bagua supports high-performance distributed deep learning without requiring special network devices or restrictive scheduling. Learn about its innovative communication algorithms and seamless integration with Kubernetes, enabling horizontal scaling of training with excellent speedup guarantees using ordinary ethernet connections. Examine Bagua's effectiveness across various scenarios and models, including ResNet on ImageNet, Bert Large, and large-scale industrial applications at Kuaishou. Gain insights into its performance advantages, outperforming PyTorch-DDP, Horovod, and BytePS in end-to-end training time by up to 1.95 times in production Kubernetes clusters. Understand how Bagua addresses challenges in recommendation model training with massive parameters, video/image understanding with billions of samples, and ASR with terabyte-level datasets.
Syllabus
Bagua: Lightweight Distributed Learning on Kubernetes - Xiangru Lian & Xianghong Li, Kuaishou
Taught by
CNCF [Cloud Native Computing Foundation]
Related Courses
Challenges and Opportunities in Applying Machine Learning - Alex Jaimes - ODSC East 2018Open Data Science via YouTube Efficient Distributed Deep Learning Using MXNet
Simons Institute via YouTube Benchmarks and How-Tos for Convolutional Neural Networks on HorovodRunner-Enabled Apache Spark Clusters
Databricks via YouTube SHADE - Enable Fundamental Cacheability for Distributed Deep Learning Training
USENIX via YouTube Alpa - Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
USENIX via YouTube