How to Not Destroy Your Production Kubernetes Clusters
Offered By: USENIX via YouTube
Course Description
Overview
Explore real-world production incident stories from managing hundreds of Kubernetes clusters, with a focus on clusters scaling to 10K+ nodes. Learn how seemingly simple operations like adding a single node or modifying a configmap can trigger chain reactions that disrupt entire clusters. Discover best practices for maintaining high cluster availability through lessons learned from failures involving postmodern databases, automation, user escalation, and paradoxical finalizers. Gain insights into mitigating paging storms, handling manual operations, and improving monitoring dashboards. Understand the importance of security context changes and key takeaways for effectively managing large-scale Kubernetes environments.
Syllabus
Intro
Background
Postmodern Database
Automation
User escalation
Initial investigation
Restoring service objects
Collecting service definitions
The impact of the incident
The reason for the failure
Fixing the webhooks
Why the operator went rogue
Kubernetes label selector package
Test engineer accidentally created app load balancer
What can we learn
Paradoxical Finalizer
Paging Storm
Mitigation
Kubernetes Platform
Manual Operations
Lessons Learned
User Complaints
Monitoring Dashboard
Victim Cluster
Security Context Change
Learnings
Recap
Key takeaways
Taught by
USENIX
Related Courses
Emergency Management: Risk, Incidents and LeadershipCoventry University via FutureLearn Security Operations
Coventry University via FutureLearn Planificación y Coordinación en Logística Humanitaria
Acción contra el Hambre via Miríadax Preparing for Google Cloud Certification: Cloud DevOps Engineer
Google Cloud via Coursera Managing Cybersecurity
University System of Georgia via Coursera