YoVDO

Bamboo - Making Preemptible Instances Resilient for Affordable Training of Large DNNs

Offered By: USENIX via YouTube

Tags

USENIX Symposium on Networked Systems Design and Implementation (NSDI) Courses Machine Learning Courses Distributed Systems Courses Cost Optimization Courses Deep Neural Networks Courses

Course Description

Overview

Explore a 15-minute conference talk from USENIX NSDI '23 that introduces Bamboo, an innovative distributed system designed to significantly reduce the costs of training large Deep Neural Network (DNN) models. Learn how Bamboo leverages preemptible instances and introduces redundant computations into the training pipeline to achieve resilience and efficiency in the face of frequent preemptions. Discover how this approach outperforms traditional checkpointing techniques, resulting in 3.7× improvement in training throughput and 2.4× reduction in costs compared to using on-demand instances. Gain insights into the challenges of training increasingly large DNN models and the novel solutions proposed to make this process more affordable for organizations and research labs of all sizes.

Syllabus

NSDI '23 - Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs


Taught by

USENIX

Related Courses

Introduction to Artificial Intelligence
Stanford University via Udacity
Natural Language Processing
Columbia University via Coursera
Probabilistic Graphical Models 1: Representation
Stanford University via Coursera
Computer Vision: The Fundamentals
University of California, Berkeley via Coursera
Learning from Data (Introductory Machine Learning course)
California Institute of Technology via Independent