Keep HPC Running - SRE's Guide to Supporting GPUs on Kubernetes
Offered By: CNCF [Cloud Native Computing Foundation] via YouTube
Course Description
Overview
Explore best practices for Site Reliability Engineers (SREs) supporting GPU-enabled Kubernetes clusters for High-Performance Computing (HPC) workloads in this informative conference talk. Delve into the unique challenges of operating GPU-equipped nodes, focusing on telemetry, observability, and criteria for human intervention. Learn about essential metrics SRE teams should incorporate into their operational tools to effectively support HPC and AI use cases, including Generative Pre-trained Transformers (GPTs), machine learning, and quantitative modeling. Examine a working example of custom plugin monitors for the Kubernetes node-problem-detector daemon, utilizing NVIDIA's open-source DCGM and NVML bindings. Gain insights into the operational importance of metrics exposed by NVIDIA's DCGM-Exporter to Prometheus for maintaining cluster and workload health. Enhance your ability to keep HPC running smoothly on Kubernetes with GPU support.
Syllabus
Keep HPC Running - an SRE's Guide to Supporting GPUs on Kubernetes - Christopher Dutra, JP Morgan
Taught by
CNCF [Cloud Native Computing Foundation]
Related Courses
Моделирование биологических молекул на GPU (Biomolecular modeling on GPU)Moscow Institute of Physics and Technology via Coursera Practical Deep Learning For Coders
fast.ai via Independent GPU Architectures And Programming
Indian Institute of Technology, Kharagpur via Swayam Perform Real-Time Object Detection with YOLOv3
Coursera Project Network via Coursera Getting Started with PyTorch
Coursera Project Network via Coursera