Best Practices for Productionizing Distributed Training with Ray Train
Offered By: Anyscale via YouTube
Course Description
Overview
Learn best practices for productionizing distributed training with Ray Train in this 30-minute tutorial from Anyscale. Explore techniques for enabling fault tolerance in large-scale machine learning workloads, including experiment restoration, recovery from node failures, using persistent cloud storage for experiment state snapshots, and performing large model checkpointing. Discover simple additions to incorporate into Ray Train applications to leverage the benefits of fault-tolerant model training. Gain insights into handling issues like out-of-memory errors and storage failures in multi-node distributed training environments, particularly relevant for training large language models. Understand how fault tolerance can help reduce costs through the use of spot instances while preserving training progress in case of failures.
Syllabus
Best Practices for Productionizing Distributed Training with Ray Train
Taught by
Anyscale
Related Courses
MongoDB for DBAsMongoDB University MongoDB Advanced Deployment and Operations
MongoDB University Building Cloud Apps with Microsoft Azure - Part 3
Microsoft via edX Implementing Microsoft Windows Server Disks and Volumes
Microsoft via edX Cloud Computing and Distributed Systems
Indian Institute of Technology Patna via Swayam