How to Not Destroy Your Production Kubernetes Clusters

Offered By: USENIX via YouTube

Course Description

Overview

Explore real-world production incident stories from managing hundreds of Kubernetes clusters, with a focus on clusters scaling to 10K+ nodes. Learn how seemingly simple operations like adding a single node or modifying a configmap can trigger chain reactions that disrupt entire clusters. Discover best practices for maintaining high cluster availability through lessons learned from failures involving postmodern databases, automation, user escalation, and paradoxical finalizers. Gain insights into mitigating paging storms, handling manual operations, and improving monitoring dashboards. Understand the importance of security context changes and key takeaways for effectively managing large-scale Kubernetes environments.

Syllabus

Intro
Background
Postmodern Database
Automation
User escalation
Initial investigation
Restoring service objects
Collecting service definitions
The impact of the incident
The reason for the failure
Fixing the webhooks
Why the operator went rogue
Kubernetes label selector package
Test engineer accidentally created app load balancer
What can we learn
Paradoxical Finalizer
Paging Storm
Mitigation
Kubernetes Platform
Manual Operations
Lessons Learned
User Complaints
Monitoring Dashboard
Victim Cluster
Security Context Change
Learnings
Recap
Key takeaways

Taught by

USENIX

How to Not Destroy Your Production Kubernetes Clusters

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

How to Not Destroy Your Production Kubernetes Clusters

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Login to Continue