Large Scale Distributed Deep Learning on Kubernetes Clusters
Offered By: Linux Foundation via YouTube
Course Description
Overview
Explore large-scale distributed deep learning deployments on Kubernetes clusters in this conference talk. Delve into the use of operators for managing and automating machine learning training processes, comparing the open-source tf-operator and mpi-operator. Examine different distribution strategies and their impact on performance, particularly regarding CPU, GPU, and network utilization. Gain insights into optimizing orchestration for deep learning tasks, which are both network and GPU intensive, to achieve better economics and prevent idle compute capacity. Learn from shared experiences and best practices for TensorFlow 2.0 workflow, parameter servers, Kubernetes operators, mirror strategy in TensorFlow, and integrations with Horovod for both TensorFlow and PyTorch.
Syllabus
Intro
Speakers
TensorFlow 2.0 Workflow
Orchestration for DL
Parameter Server
Reduce
Kubernetes Operators
Mirror Strategy in TF
TensorFlow + Hovorod
PyTorch + Hovorod
Recall: TFJob vs. MPIJob
Shared API and Best Practices
Taught by
Linux Foundation
Tags
Related Courses
Automated Background Removal Using PyTorch - Wehkamp's Image Processing PipelineDatabricks via YouTube Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks
USENIX via YouTube Machine Learning Using Kubeflow and Kubernetes
Devoxx via YouTube Building Deep Learning Models on Databricks
Pluralsight Democratizing Deep Learning at Scale with Horovod
Linux Foundation via YouTube