YoVDO

Check-N-Run - A Checkpointing System for Training Deep Learning Recommendation Models

Offered By: USENIX via YouTube

Tags

USENIX Symposium on Networked Systems Design and Implementation (NSDI) Courses Machine Learning Courses Quantization Courses High Performance Computing Courses Recommendation Systems Courses

Course Description

Overview

Explore a cutting-edge checkpointing system for training large-scale deep learning recommendation models in this NSDI '22 conference talk. Dive into the challenges of checkpointing massive ML models and discover how Check-N-Run addresses size and bandwidth issues. Learn about differential checkpointing techniques that track and save only modified parts of the model, particularly effective for recommendation models with embedding tables. Examine quantization strategies that significantly reduce checkpoint size without compromising training accuracy. Understand how these innovations lead to substantial reductions in required write bandwidth and storage capacity, improving checkpoint capabilities while lowering total ownership costs. Gain insights into the architecture of recommendation models, high-performance training at Meta, and the critical role of checkpointing in failure recovery and continuous learning for online training.

Syllabus

Intro
Recommendation Models are important . Use cases include
Recommendation Model Architecture
High Performance Training at Meta
The Criticality of Checkpointing • Failure recovery ensure progress
Checkpoint Challenges
Check-n-Run
Checkpointing Workflow
Reducing WB with Differential Checkpointing
Approaches for Differential Checkpointing • One-Shot Differential Checkpoint . Consecutive Incremental Checkpoint - Intermittent Differential Checkpoint
Checkpoint Quantization Compress checkpoint without degrading training accuracy
Comparing Quantization Strategies . Uniform quantization . Non-uniform quantization using kmeans • Adaptive uniform quantization
Quantization Bit-width Selection
Overall Reduction
Summary


Taught by

USENIX

Related Courses

Introduction to Artificial Intelligence
Stanford University via Udacity
Natural Language Processing
Columbia University via Coursera
Probabilistic Graphical Models 1: Representation
Stanford University via Coursera
Computer Vision: The Fundamentals
University of California, Berkeley via Coursera
Learning from Data (Introductory Machine Learning course)
California Institute of Technology via Independent