YoVDO

Are We Getting Better Yet? - Progress Toward Safer Operations

Offered By: USENIX via YouTube

Tags

SREcon Courses Service-Level Objectives Courses Incident Analysis Courses

Course Description

Overview

Explore strategies for improving operational safety and incident management in this SREcon20 Americas talk. Delve into the complexities of measuring progress beyond shallow metrics, understanding the subtle influences of incidents on organizations, and leveraging thorough incident analysis for deeper insights. Learn how to uncover unseen opportunities through meta-analyses across incidents, provide leaders with richer data for strategic decision-making, and foster trust between leadership and practitioners. Discover the importance of prioritizing a "learn and adapt" safety mode, conducting chaos experiments as practice, and considering performance variability. Examine the balance between control and influence, opportunity and obligation, and the significance of asking deeper questions during incident debriefings. Gain valuable insights into creating healthier, happier teams and advancing toward safer operations in complex systems.

Syllabus

Complexity
Law of Stretched Systems
Prioritize a learn and adapt safety mode over a prevent and fix safety mode
Prevent & Fix
Learn & Adapt
Measuring progress
Metrics anchor the story and the story gives meaning to the metrics
Barriers and guardrails are used to prevent people from repeating mistakes
Performance variability
Ensure positive outcomes through activities like team practice and chaos experiments
Chaos experiments as scrimmage
Incidents are a source of insights
Service Level Objectives
Control vs Influence
Watch the inputs Influence the outputs
Opportunity vs Obligation
Judging human performance with metrics applies conclusions without context
Recording performance metrics promotes one perspective over others
Interview Debriefing
Ask deeper questions
How close to the safety boundary is the pod autoscaler pushing my infrastructure?
Are my cloud provider's staff a team player in my sociotechnical system?
Recap


Taught by

USENIX

Related Courses

Developing a Google SRE Culture - 日本語版
Google Cloud via Coursera
Integrated safety, health and environmental management: An introduction
The Open University via OpenLearn
Incident Detection and Response: The Big Picture
Pluralsight
Threat Analysis
Cisco via Coursera
Inside the Biggest Hacks and Facts of the Past Year - 2022-2023
BruCON Security Conference via YouTube