YoVDO

Keep HPC Running - SRE's Guide to Supporting GPUs on Kubernetes

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Kubernetes Courses Machine Learning Courses Prometheus Courses High Performance Computing Courses GPU Computing Courses Telemetry Courses Observability Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore best practices for Site Reliability Engineers (SREs) supporting GPU-enabled Kubernetes clusters for High-Performance Computing (HPC) workloads in this informative conference talk. Delve into the unique challenges of operating GPU-equipped nodes, focusing on telemetry, observability, and criteria for human intervention. Learn about essential metrics SRE teams should incorporate into their operational tools to effectively support HPC and AI use cases, including Generative Pre-trained Transformers (GPTs), machine learning, and quantitative modeling. Examine a working example of custom plugin monitors for the Kubernetes node-problem-detector daemon, utilizing NVIDIA's open-source DCGM and NVML bindings. Gain insights into the operational importance of metrics exposed by NVIDIA's DCGM-Exporter to Prometheus for maintaining cluster and workload health. Enhance your ability to keep HPC running smoothly on Kubernetes with GPU support.

Syllabus

Keep HPC Running - an SRE's Guide to Supporting GPUs on Kubernetes - Christopher Dutra, JP Morgan


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

Introduction to Artificial Intelligence
Stanford University via Udacity
Natural Language Processing
Columbia University via Coursera
Probabilistic Graphical Models 1: Representation
Stanford University via Coursera
Computer Vision: The Fundamentals
University of California, Berkeley via Coursera
Learning from Data (Introductory Machine Learning course)
California Institute of Technology via Independent