YoVDO

Breaking Boundaries - TACC as a Unified Cloud-Native Infrastructure for AI and HPC

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Kubernetes Courses Artificial Intelligence Courses High Performance Computing Courses Slurm Courses Multi-Tenancy Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the innovative TACC (Tencent AI Computing Center) solution for managing AI infrastructure in this conference talk by Peter Pan from DaoCloud and Kaiqiang Xu from Hong Kong University of Science and Technology. Discover how TACC bridges the gap between Kubernetes and Slurm-based HPC setups, offering a unified cloud-native infrastructure for AI and high-performance computing. Learn about the five-year journey of implementing TACC at HKUST, supporting over 500 active researchers since 2020. Gain insights into the seamless user interface for job submissions, multi-tenant resource allocation strategies using CNCF HAMi and Kueue, and the robust distributed infrastructure featuring networked storage and RDMA via CNCF SpiderPool and Fluid. Understand how TACC addresses the challenges of managing large-scale GPU clusters for AI models while providing enhanced management granularity, stability, and usability for AI researchers.

Syllabus

Breaking Boundaries: TACC as an Unified Cloud-Native Infra for AI + HPC - Peter Pan & Kaiqiang Xu


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

Linux for Scientific Computing Masterclass - 10.5 Hours
Udemy
Reduce Time to Market and Capital Spend Using Software Operators in HPC
Ubuntu OnAir via YouTube
HPC with Containers on Ubuntu - Enroot and Pyxis Implementation
Ubuntu OnAir via YouTube
PyKubeSlurm - A Python Operator for Efficient Job Scheduling in Slurm Using Kubernetes
Ubuntu OnAir via YouTube
Running Plain Kubernetes Pods on SLURM - April 17, 2024
CNCF [Cloud Native Computing Foundation] via YouTube