Scaling ML Workloads with PyTorch
Offered By: Microsoft via YouTube
Course Description
Overview
Syllabus
Introduction.
Why is large model training needed?.
Scaling creates training and model efficiency.
Larger models = more efficient, less training, less data.
Larger models can learn with few shot learning.
Democratizing largescale language models with OPT175B.
Challenges of large model training.
What is PyTorch Distributed?.
Features Overview.
DistributedDataParallel.
FullyShardedDataParallel.
FSDP Auto wrapping.
FSDP Auto wrapping example.
FSDP CPU Offload, Backward Prefetch policies.
FSDP Mixed Precision control.
Pipeline.
Example Auto Partitioning.
Pipeline + DDP (PDP).
Memory Saving Features.
Activation Checkpointing.
Activation Offloading.
Activation Checkpointing & Offloading.
Parameter Offloading.
Memory Saving Feature & Training Paradigms.
Experiments & Insights.
Model Implementation.
Scaling Efficiency Varying # GPUs.
Scaling Efficiency Varying World Size.
Scaling Efficiency Varying Batch Size.
Model Scale Limit.
Impact of Network Bandwidth.
Best Practices.
Best Practices FSDP.
Profiling & Troubleshooting.
Profiling & Troubleshooting for Large Scale Model Training.
Uber Prof (Experimental) Profiling & Troubleshooting tool.
Demonstration.
Combining DCGM + Profiling.
Profiling for Large Scale Model Training.
Nvidia NSights multinode, multigpu Profiling.
PyTorch Profiler Distributed Training Profiling (single node multigpu).
Try it now.
Resources.
Closing Notes.
Taught by
Microsoft Developer
Tags
Related Courses
Introduction to Artificial IntelligenceStanford University via Udacity Natural Language Processing
Columbia University via Coursera Probabilistic Graphical Models 1: Representation
Stanford University via Coursera Computer Vision: The Fundamentals
University of California, Berkeley via Coursera Learning from Data (Introductory Machine Learning course)
California Institute of Technology via Independent