YoVDO

The Day We Deleted Production - Kubernetes Infrastructure Recovery at CERN

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Kubernetes Courses Disaster Recovery Courses Scientific Computing Courses High Availability Courses Infrastructure Management Courses Incident Management Courses GitOps Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a critical incident at CERN where a maintenance tool accidentally deleted a third of the production capacity in minutes. Learn how the Kubernetes infrastructure at CERN, which runs various workloads from scientific computing to critical services for campus and physics accelerator complex, managed to avoid downtime and recover quickly. Discover the architecture for high service availability, strategies to reduce blast radius, the concept of "clusters as cattle," and the crucial role of GitOps in saving the day. Gain insights into lessons learned, including cyclic dependencies during major outage recovery and considerations for stateful workloads and multi-cluster scheduling. Watch a live demonstration of CERN services recovering from what would have been a severe event in the past, and understand how years of effort have resulted in calm user responses during major incidents.

Syllabus

The Day We Delete(d) Production - Ricardo Rocha & Spyridon Trigazis, CERN


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

Scientific Computing
University of Washington via Coursera
Biology Meets Programming: Bioinformatics for Beginners
University of California, San Diego via Coursera
High Performance Scientific Computing
University of Washington via Coursera
Practical Numerical Methods with Python
George Washington University via Independent
Julia Scientific Programming
University of Cape Town via Coursera