YoVDO

The Secret Lives of SREs - Controlling the Costs of Coordination across Remote Teams

Offered By: USENIX via YouTube

Tags

SREcon Courses Distributed Computing Courses Incident Management Courses

Course Description

Overview

Explore the intricacies of incident response and coordination in remote SRE teams through this 48-minute conference talk from SREcon20 Americas. Delve into Dr. Laura Maguire's three-year research on engineering teams handling service outages, examining 62 cases across four organizations. Discover surprising findings that challenge existing domain models, including how incident management differs from GoogleSRE suggestions and how incident command can hinder fast resolution. Learn about the subtle choreography of cognitive work in fault management, the potential drawbacks of coordination tools, and strategies for adaptive choreography. Gain insights into how tooling and intra-organizational dependencies affect coordination costs across time and organizational boundaries, increasing complexity for SREs. Understand the challenges of coordinating multiple perspectives, dealing with backup issues, and managing hidden complexities in distributed computing environments.

Syllabus

Introduction
The Secret Lives of SREs
Coordinate Multiple Diverse Perspectives
Backup Issues
Hidden Complexity
Outlier Event
Sarah
Sarahs Knowledge
Incident Response
Incident Command
Speed Bumps
Distributed Computing
Conclusion


Taught by

USENIX

Related Courses

How to Not Destroy Your Production Kubernetes Clusters
USENIX via YouTube
SRE and ML - Why It Matters
USENIX via YouTube
Knowledge and Power - A Sociotechnical Systems Discussion on the Future of SRE
USENIX via YouTube
Tracing Bare Metal with OpenTelemetry
USENIX via YouTube
Improving How We Observe Our Observability Data - Techniques for SREs
USENIX via YouTube