YoVDO

Distributed Deep Learning on Apache Mesos with GPUs and Gang Scheduling

Offered By: Linux Foundation via YouTube

Tags

Apache Mesos Courses TensorFlow Courses GPU Computing Courses Scalability Courses Cluster Management Courses Distributed Deep Learning Courses Horovod Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore distributed deep learning on Apache Mesos with GPU support and gang scheduling in this 37-minute conference talk from UBER engineers. Learn how to speed up complex model training, scale to hundreds of GPUs, and shard models that don't fit on a single machine. Discover the design and implementation of running distributed TensorFlow on Mesos clusters with hundreds of GPUs, leveraging key features like GPU isolation and nested containers. Gain insights into GPU and gang scheduling, task discovery, and dynamic port allocation. See real-world examples of distributed training speed-ups using a TensorFlow model for image classification. Delve into UBER's deep learning applications in self-driving vehicles, trip forecasting, and fraud detection. Understand the architecture of Peloton, UBER's cluster management system, and its features for elastic GPU resource management, resource pools, and placement strategies. Compare distributed TensorFlow and Horovod architectures on Mesos, and examine their performance benefits for large-scale deep learning tasks.

Syllabus

Intro
Deep Learning @ UBER
Self-Driving Vehicles
Trip Forecasting
Fraud Detection
Why Distributed Deep Learning?
How Distributed Deep Learning Works
Why Mesos?
Mesos Support for GPUs
Mesos Nested Containers
What is Missing?
Peloton Overview
Peloton Architecture
Elastic GPU Resource Management
Resource Pools
Gang Scheduling
Placement Strategies
Why TensorFlow?
Architecture for Distributed TensorFlow on Mesos
Can We Do Better?
Architecture for Horovod on Mesos
Distributed Training Performance with Horovod
What About Usability?
Giving Back
Thank you!


Taught by

Linux Foundation

Tags

Related Courses

Challenges and Opportunities in Applying Machine Learning - Alex Jaimes - ODSC East 2018
Open Data Science via YouTube
Efficient Distributed Deep Learning Using MXNet
Simons Institute via YouTube
Benchmarks and How-Tos for Convolutional Neural Networks on HorovodRunner-Enabled Apache Spark Clusters
Databricks via YouTube
SHADE - Enable Fundamental Cacheability for Distributed Deep Learning Training
USENIX via YouTube
Alpa - Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
USENIX via YouTube