YoVDO

Large Scale Distributed Deep Learning on Kubernetes Clusters

Offered By: Linux Foundation via YouTube

Tags

Kubernetes Courses TensorFlow Courses PyTorch Courses Scalability Courses Orchestration Courses Distributed Deep Learning Courses Horovod Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore large-scale distributed deep learning deployments on Kubernetes clusters in this conference talk. Delve into the use of operators for managing and automating machine learning training processes, comparing the open-source tf-operator and mpi-operator. Examine different distribution strategies and their impact on performance, particularly regarding CPU, GPU, and network utilization. Gain insights into optimizing orchestration for deep learning tasks, which are both network and GPU intensive, to achieve better economics and prevent idle compute capacity. Learn from shared experiences and best practices for TensorFlow 2.0 workflow, parameter servers, Kubernetes operators, mirror strategy in TensorFlow, and integrations with Horovod for both TensorFlow and PyTorch.

Syllabus

Intro
Speakers
TensorFlow 2.0 Workflow
Orchestration for DL
Parameter Server
Reduce
Kubernetes Operators
Mirror Strategy in TF
TensorFlow + Hovorod
PyTorch + Hovorod
Recall: TFJob vs. MPIJob
Shared API and Best Practices


Taught by

Linux Foundation

Tags

Related Courses

Challenges and Opportunities in Applying Machine Learning - Alex Jaimes - ODSC East 2018
Open Data Science via YouTube
Efficient Distributed Deep Learning Using MXNet
Simons Institute via YouTube
Benchmarks and How-Tos for Convolutional Neural Networks on HorovodRunner-Enabled Apache Spark Clusters
Databricks via YouTube
SHADE - Enable Fundamental Cacheability for Distributed Deep Learning Training
USENIX via YouTube
Alpa - Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
USENIX via YouTube