Canarying Well - Lessons Learned from Canarying Large Populations
Offered By: USENIX via YouTube
Course Description
Overview
Explore the intricacies of canarying in production environments through this insightful conference talk from SREcon18 Europe. Delve into common pitfalls, best practices, and a comprehensive end-to-end strategy for implementing effective canary processes. Learn from Google's Štěpán Davidovič as he shares valuable lessons on controlled rollouts to mitigate risks in large-scale systems. Gain a deeper understanding of canarying priorities, geographical distribution challenges, high variance scenarios, and bimodal distributions. Examine real-world examples involving service caches, memory leaks, and compound probabilities. Discover the importance of careful metric selection and analysis in ensuring successful canary deployments. Walk away with practical knowledge on implementing a robust three-step canary process to enhance the safety and reliability of your production changes.
Syllabus
Intro
Canarying: What is that?
What we're going to talk about
What we're not going to talk about
Conflicting Incentives
Triangle of Canarying Priorities
Example: Geographical distribution
Example: High variance among replicas
Example: Bimodal distribution
Example: Two metrics, different outliers
Takeaways 2
Example: Service With Cache, restarted
Example: Memory leak canary
Example: Before/after test
Example Takeaway
Example: Compound probability
Beware Meta Analysis
Prefer Few Metrics
Canary In These 3 Simple Steps
Canary In These 3-ish Simple Steps
Taught by
USENIX
Related Courses
How to Not Destroy Your Production Kubernetes ClustersUSENIX via YouTube SRE and ML - Why It Matters
USENIX via YouTube Knowledge and Power - A Sociotechnical Systems Discussion on the Future of SRE
USENIX via YouTube Tracing Bare Metal with OpenTelemetry
USENIX via YouTube Improving How We Observe Our Observability Data - Techniques for SREs
USENIX via YouTube