YoVDO

The Day We Deleted Production - Kubernetes Infrastructure Recovery at CERN

Offered By: CNCF [Cloud Native Computing Foundation] via YouTube

Tags

Kubernetes Courses Disaster Recovery Courses Scientific Computing Courses High Availability Courses Infrastructure Management Courses Incident Management Courses GitOps Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a critical incident at CERN where a maintenance tool accidentally deleted a third of the production capacity in minutes. Learn how the Kubernetes infrastructure at CERN, which runs various workloads from scientific computing to critical services for campus and physics accelerator complex, managed to avoid downtime and recover quickly. Discover the architecture for high service availability, strategies to reduce blast radius, the concept of "clusters as cattle," and the crucial role of GitOps in saving the day. Gain insights into lessons learned, including cyclic dependencies during major outage recovery and considerations for stateful workloads and multi-cluster scheduling. Watch a live demonstration of CERN services recovering from what would have been a severe event in the past, and understand how years of effort have resulted in calm user responses during major incidents.

Syllabus

The Day We Delete(d) Production - Ricardo Rocha & Spyridon Trigazis, CERN


Taught by

CNCF [Cloud Native Computing Foundation]

Related Courses

Emergency Management
Open2Study
Resilience in Children Exposed to Trauma, Disaster and War: Global Perspectives
University of Minnesota via Coursera
MongoDB Advanced Deployment and Operations
MongoDB University
Arch403: Designing Resilient Schools
Build Academy via EdCast
Bases de données relationnelles : Comprendre pour maîtriser
Inria (French Institute for Research in Computer Science and Automation) via France Université Numerique