SHADE - Enable Fundamental Cacheability for Distributed Deep Learning Training
Offered By: USENIX via YouTube
Course Description
Overview
Explore a groundbreaking approach to optimizing distributed deep learning training (DLT) in this conference talk from FAST '23. Dive into SHADE, a novel DLT-aware caching system that addresses the I/O performance bottleneck in accelerator-driven environments. Learn how SHADE leverages importance sampling to detect fine-grained variations at the per-sample level, making informed caching decisions for distributed DLT jobs. Discover the innovative rank-based approach that captures relative importance across different minibatches and dynamically updates importance scores during training. Examine the significant improvements in cache hit ratio and overall training performance achieved by SHADE, particularly in computer vision models. Gain insights into the challenges posed by exponentially growing dataset sizes and the unique I/O workload behaviors of DLT applications, and understand how SHADE's techniques can revolutionize storage system design for deep learning.
Syllabus
FAST '23 - SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training
Taught by
USENIX
Related Courses
Understanding the Robustness of SSDs under Power FaultUSENIX via YouTube BetrFS - A Right-Optimized Write-Optimized File System
USENIX via YouTube F2FS - A New File System for Flash Storage
USENIX via YouTube DNA Data Storage and Near-Molecule Processing for the Yottabyte Era
USENIX via YouTube FAST '21 Work-in-Progress Reports
USENIX via YouTube