YoVDO

How We Power the Largest AI Deployments on the Planet - Running Virtual Clusters at Scale

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Kubernetes Courses GPU Computing Courses Scalability Courses Cluster Management Courses Serverless Computing Courses Autoscaling Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the challenges and solutions of running large-scale Kubernetes clusters for AI deployments in this 25-minute conference talk. Dive into CoreWeave's experience managing over 3,000 Kubernetes clusters on 5,000 bare metal nodes with massive GPU resources to power modern AI applications. Learn about the partnership between CoreWeave and Loft Labs, and discover how they overcame obstacles in security, GPU provisioning, and scalability. Gain insights into the pitfalls, design choices, and architectural challenges faced over three years while developing a serverless Kubernetes offering. Topics covered include secure tenant isolation on shared infrastructure, achieving 10-second autoscaling, on-demand cluster and compute provisioning, and day-2 operations for managing a large fleet of clusters at scale.

Syllabus

How We Power the Largest AI Deployments on the Planet: Running Vir... Brandon Jacobs & Lukas Gentele


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

Financial Sustainability: The Numbers side of Social Enterprise
+Acumen via NovoEd
Cloud Computing Concepts: Part 2
University of Illinois at Urbana-Champaign via Coursera
Developing Repeatable ModelsĀ® to Scale Your Impact
+Acumen via Independent
Managing Microsoft Windows Server Active Directory Domain Services
Microsoft via edX
Introduction aux conteneurs
Microsoft Virtual Academy via OpenClassrooms