YoVDO

Best Practices for Productionizing Distributed Training with Ray Train

Offered By: Anyscale via YouTube

Tags

Distributed Training Courses Machine Learning Courses Fault Tolerance Courses Cloud Storage Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn best practices for productionizing distributed training with Ray Train in this 30-minute tutorial from Anyscale. Explore techniques for enabling fault tolerance in large-scale machine learning workloads, including experiment restoration, recovery from node failures, using persistent cloud storage for experiment state snapshots, and performing large model checkpointing. Discover simple additions to incorporate into Ray Train applications to leverage the benefits of fault-tolerant model training. Gain insights into handling issues like out-of-memory errors and storage failures in multi-node distributed training environments, particularly relevant for training large language models. Understand how fault tolerance can help reduce costs through the use of spot instances while preserving training progress in case of failures.

Syllabus

Best Practices for Productionizing Distributed Training with Ray Train


Taught by

Anyscale

Related Courses

Architecting Microsoft Azure Solutions
Microsoft via edX
Computing, Storage and Security with Google Cloud Platform
Google via Coursera
Windows Server 2016: Azure for On-Premises Administrators
Microsoft via edX
Microsoft Professional Orientation : Cloud Administration
Microsoft via edX
IT Support: Troubleshooting Microsoft Office
Microsoft via edX