SHADE - Enable Fundamental Cacheability for Distributed Deep Learning Training

Offered By: USENIX via YouTube

Course Description

Overview

Explore a groundbreaking approach to optimizing distributed deep learning training (DLT) in this conference talk from FAST '23. Dive into SHADE, a novel DLT-aware caching system that addresses the I/O performance bottleneck in accelerator-driven environments. Learn how SHADE leverages importance sampling to detect fine-grained variations at the per-sample level, making informed caching decisions for distributed DLT jobs. Discover the innovative rank-based approach that captures relative importance across different minibatches and dynamically updates importance scores during training. Examine the significant improvements in cache hit ratio and overall training performance achieved by SHADE, particularly in computer vision models. Gain insights into the challenges posed by exponentially growing dataset sizes and the unique I/O workload behaviors of DLT applications, and understand how SHADE's techniques can revolutionize storage system design for deep learning.

Syllabus

FAST '23 - SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training

Taught by

USENIX

Related Courses

Challenges and Opportunities in Applying Machine Learning - Alex Jaimes - ODSC East 2018
Open Data Science via YouTube Efficient Distributed Deep Learning Using MXNet
Simons Institute via YouTube Benchmarks and How-Tos for Convolutional Neural Networks on HorovodRunner-Enabled Apache Spark Clusters
Databricks via YouTube Alpa - Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
USENIX via YouTube Horovod - Distributed Deep Learning for Reliable MLOps
Linux Foundation via YouTube