How to Not Destroy Your Production Kubernetes Clusters
Offered By: USENIX via YouTube
Course Description
Overview
Explore real-world production incident stories from managing hundreds of Kubernetes clusters, with a focus on clusters scaling to 10K+ nodes. Learn how seemingly simple operations like adding a single node or modifying a configmap can trigger chain reactions that disrupt entire clusters. Discover best practices for maintaining high cluster availability through lessons learned from failures involving postmodern databases, automation, user escalation, and paradoxical finalizers. Gain insights into mitigating paging storms, handling manual operations, and improving monitoring dashboards. Understand the importance of security context changes and key takeaways for effectively managing large-scale Kubernetes environments.
Syllabus
Intro
Background
Postmodern Database
Automation
User escalation
Initial investigation
Restoring service objects
Collecting service definitions
The impact of the incident
The reason for the failure
Fixing the webhooks
Why the operator went rogue
Kubernetes label selector package
Test engineer accidentally created app load balancer
What can we learn
Paradoxical Finalizer
Paging Storm
Mitigation
Kubernetes Platform
Manual Operations
Lessons Learned
User Complaints
Monitoring Dashboard
Victim Cluster
Security Context Change
Learnings
Recap
Key takeaways
Taught by
USENIX
Related Courses
Introduction to Cloud Infrastructure TechnologiesLinux Foundation via edX Scalable Microservices with Kubernetes
Google via Udacity Google Cloud Fundamentals: Core Infrastructure
Google via Coursera Introduction to Kubernetes
Linux Foundation via edX Fundamentals of Containers, Kubernetes, and Red Hat OpenShift
Red Hat via edX