YoVDO

Monitoring GPUs at Scale for AI - ML and HPC Clusters

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Conference Talks Courses Artificial Intelligence Courses Machine Learning Courses Capacity Planning Courses High Performance Computing Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a comprehensive conference talk on monitoring GPU clusters for AI/ML and HPC workloads at scale. Learn how NVIDIA addresses the monitoring needs of various user personas, including AI/ML researchers, operations teams, and stakeholders. Discover the combination of open-source tools used to meet diverse requirements and gain insights into deployment, maintenance, security, and scalability challenges encountered when monitoring GPU data. Understand how NVIDIA overcame these obstacles to create an effective monitoring solution for large GPU Kubernetes clusters running deep learning training workloads.

Syllabus

Monitoring GPUs at Scale for AI/ML and HPC Clusters - Bharti L Agrawal, NVIDIA


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

Introduction to Artificial Intelligence
Stanford University via Udacity
Natural Language Processing
Columbia University via Coursera
Probabilistic Graphical Models 1: Representation
Stanford University via Coursera
Computer Vision: The Fundamentals
University of California, Berkeley via Coursera
Learning from Data (Introductory Machine Learning course)
California Institute of Technology via Independent