Monitoring GPUs at Scale for AI - ML and HPC Clusters
Offered By: CNCF [Cloud Native Computing Foundation] via YouTube
Course Description
Overview
Explore a comprehensive conference talk on monitoring GPU clusters for AI/ML and HPC workloads at scale. Learn how NVIDIA addresses the monitoring needs of various user personas, including AI/ML researchers, operations teams, and stakeholders. Discover the combination of open-source tools used to meet diverse requirements and gain insights into deployment, maintenance, security, and scalability challenges encountered when monitoring GPU data. Understand how NVIDIA overcame these obstacles to create an effective monitoring solution for large GPU Kubernetes clusters running deep learning training workloads.
Syllabus
Monitoring GPUs at Scale for AI/ML and HPC Clusters - Bharti L Agrawal, NVIDIA
Taught by
CNCF [Cloud Native Computing Foundation]
Related Courses
Building Geospatial Apps on Postgres, PostGIS, & Citus at Large ScaleMicrosoft via YouTube Unlocking the Power of ML for Your JavaScript Applications with TensorFlow.js
TensorFlow via YouTube Managing the Reactive World with RxJava - Jake Wharton
ChariotSolutions via YouTube What's New in Grails 2.0
ChariotSolutions via YouTube Performance Analysis of Apache Spark and Presto in Cloud Environments
Databricks via YouTube