Are We Getting Better Yet? - Progress Toward Safer Operations
Offered By: USENIX via YouTube
Course Description
Overview
Syllabus
Complexity
Law of Stretched Systems
Prioritize a learn and adapt safety mode over a prevent and fix safety mode
Prevent & Fix
Learn & Adapt
Measuring progress
Metrics anchor the story and the story gives meaning to the metrics
Barriers and guardrails are used to prevent people from repeating mistakes
Performance variability
Ensure positive outcomes through activities like team practice and chaos experiments
Chaos experiments as scrimmage
Incidents are a source of insights
Service Level Objectives
Control vs Influence
Watch the inputs Influence the outputs
Opportunity vs Obligation
Judging human performance with metrics applies conclusions without context
Recording performance metrics promotes one perspective over others
Interview Debriefing
Ask deeper questions
How close to the safety boundary is the pod autoscaler pushing my infrastructure?
Are my cloud provider's staff a team player in my sociotechnical system?
Recap
Taught by
USENIX
Related Courses
Developing a Google SRE Culture - 日本語版Google Cloud via Coursera Integrated safety, health and environmental management: An introduction
The Open University via OpenLearn Incident Detection and Response: The Big Picture
Pluralsight Threat Analysis
Cisco via Coursera Inside the Biggest Hacks and Facts of the Past Year - 2022-2023
BruCON Security Conference via YouTube