Antics, Drift, and Chaos
Offered By: Strange Loop Conference via YouTube
Course Description
Overview
Syllabus
Antics, drift and chaos
Add a new test
Result: execution of unit test led to an outage
Moral: use unit tests sparingly, for they are dangerous
Complex systems exhibit unexpected behavior
System failure
Generalized Uncertainty Principle
Error handling
Latency increases
More clients retry
Support systems
Non-critical service failed
Lock shared by app threads
Lock contention
Memory leak bug in agent that monitors health of EBS servers
Mitigation
Command input entered incorrectly
Lorin's conjecture
Recap: Antics
Act II: Drift
Broken parts and sloppy devs
Drift into failure
Unruly technology
Software is hard to reason about
Scarcity and competition
Efficiency vs thoroughness
Decrementalism
Sensitive dependence on initial conditions
One day...
Traffic spike
Recap: Drift
Make the wrong thing harder
Chaos engineering
Find vulnerabilities before they become outages
External validity
Risk: vulnerable to failure of non-critical services
Build a hypothesis around steady state behavior
Vary real-world events
Fail RPC calls
Add latency to RPC calls
Run experiments in production
Route prod traffic to ChAP clusters
Automate experiments to run continuously
Integrate with deployment pipelines
Minimize blast radius
Route a small fraction of traffic
Takeaways
1. Systems behave pathologically
Chaos experiments can find pathologies
2. Reasonable human decisions can lead to dangerous states
Chaos provides incentives
Taught by
Strange Loop Conference
Tags
Related Courses
DevOps Foundations: Chaos EngineeringLinkedIn Learning Practical Chaos Engineering - Breaking Things on Purpose to Make Them More Resilient Against Failure
NDC Conferences via YouTube Patterns for Resilient Architecture
NDC Conferences via YouTube Challenges of Starting an SRE Team from Scratch in an Enterprise
USENIX via YouTube The Smallest Possible SRE Team
USENIX via YouTube