Are We Getting Better Yet? - Progress Toward Safer Operations
Offered By: USENIX via YouTube
Course Description
Overview
Syllabus
Complexity
Law of Stretched Systems
Prioritize a learn and adapt safety mode over a prevent and fix safety mode
Prevent & Fix
Learn & Adapt
Measuring progress
Metrics anchor the story and the story gives meaning to the metrics
Barriers and guardrails are used to prevent people from repeating mistakes
Performance variability
Ensure positive outcomes through activities like team practice and chaos experiments
Chaos experiments as scrimmage
Incidents are a source of insights
Service Level Objectives
Control vs Influence
Watch the inputs Influence the outputs
Opportunity vs Obligation
Judging human performance with metrics applies conclusions without context
Recording performance metrics promotes one perspective over others
Interview Debriefing
Ask deeper questions
How close to the safety boundary is the pod autoscaler pushing my infrastructure?
Are my cloud provider's staff a team player in my sociotechnical system?
Recap
Taught by
USENIX
Related Courses
How to Not Destroy Your Production Kubernetes ClustersUSENIX via YouTube SRE and ML - Why It Matters
USENIX via YouTube Knowledge and Power - A Sociotechnical Systems Discussion on the Future of SRE
USENIX via YouTube Tracing Bare Metal with OpenTelemetry
USENIX via YouTube Improving How We Observe Our Observability Data - Techniques for SREs
USENIX via YouTube