Breaking Boundaries - TACC as a Unified Cloud-Native Infrastructure for AI and HPC
Offered By: Linux Foundation via YouTube
Course Description
Overview
Explore a groundbreaking conference talk on TACC (Tensor Accelerator Cluster), an innovative AI infrastructure management solution bridging the gap between Kubernetes and Slurm setups. Discover how TACC addresses the challenges of managing large GPU clusters for AI models, offering a unified cloud-native infrastructure for both AI and High-Performance Computing (HPC). Learn about the five-year journey of implementing TACC at Hong Kong University of Science and Technology, supporting over 500 active researchers since 2020. Delve into key aspects such as user experience improvements, resource management strategies, and performance enhancements. Gain insights on the seamless UI for job submissions, multi-tenant allocation using CNCF HAMi and Kueue, and robust distributed infrastructure featuring networked storage and RDMA via CNCF SpiderPool and Fluid. Understand how TACC combines the advantages of Kubernetes and Slurm to create a more efficient and user-friendly environment for AI and HPC workloads.
Syllabus
Breaking Boundaries: TACC as an Unified Cloud-Native Infra for AI + HPC - Peter Pan & Kaiqiang Xu
Taught by
Linux Foundation
Tags
Related Courses
Linux for Scientific Computing Masterclass - 10.5 HoursUdemy Reduce Time to Market and Capital Spend Using Software Operators in HPC
Ubuntu OnAir via YouTube HPC with Containers on Ubuntu - Enroot and Pyxis Implementation
Ubuntu OnAir via YouTube PyKubeSlurm - A Python Operator for Efficient Job Scheduling in Slurm Using Kubernetes
Ubuntu OnAir via YouTube Running Plain Kubernetes Pods on SLURM - April 17, 2024
CNCF [Cloud Native Computing Foundation] via YouTube