YoVDO

Spotify's Approach to Distributed LLM Training with Ray on GKE

Offered By: Anyscale via YouTube

Tags

Kubernetes Courses Distributed Training Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore Spotify's innovative approach to distributed Large Language Model (LLM) training in this Ray Summit 2024 breakout session. Discover how Spotify adapts to Generative AI demands by building an ML platform with Ray on Google Kubernetes Engine (GKE). Learn about their implementation of LLM support for training models exceeding 70B parameters, management of diverse machine types including NVIDIA H100 GPUs, and Kubernetes-based resource allocation. Gain insights into performance optimization techniques like compact placement and NCCL Fast Socket. Understand how Ray is leveraged to distribute training applications across GKE-managed resources, providing valuable information for organizations aiming to implement or enhance their LLM training capabilities using cloud-based solutions with Ray and Kubernetes.

Syllabus

Spotify's Approach to Distributed LLM Training with Ray on GKE | Ray Summit 2024


Taught by

Anyscale

Related Courses

Custom and Distributed Training with TensorFlow
DeepLearning.AI via Coursera
Architecting Production-ready ML Models Using Google Cloud ML Engine
Pluralsight
Building End-to-end Machine Learning Workflows with Kubeflow
Pluralsight
Deploying PyTorch Models in Production: PyTorch Playbook
Pluralsight
Inside TensorFlow
TensorFlow via YouTube