The Secret Lives of SREs - Controlling the Costs of Coordination across Remote Teams
Offered By: USENIX via YouTube
Course Description
Overview
Explore the intricacies of incident response and coordination in remote SRE teams through this 48-minute conference talk from SREcon20 Americas. Delve into Dr. Laura Maguire's three-year research on engineering teams handling service outages, examining 62 cases across four organizations. Discover surprising findings that challenge existing domain models, including how incident management differs from GoogleSRE suggestions and how incident command can hinder fast resolution. Learn about the subtle choreography of cognitive work in fault management, the potential drawbacks of coordination tools, and strategies for adaptive choreography. Gain insights into how tooling and intra-organizational dependencies affect coordination costs across time and organizational boundaries, increasing complexity for SREs. Understand the challenges of coordinating multiple perspectives, dealing with backup issues, and managing hidden complexities in distributed computing environments.
Syllabus
Introduction
The Secret Lives of SREs
Coordinate Multiple Diverse Perspectives
Backup Issues
Hidden Complexity
Outlier Event
Sarah
Sarahs Knowledge
Incident Response
Incident Command
Speed Bumps
Distributed Computing
Conclusion
Taught by
USENIX
Related Courses
Cloud Computing Concepts, Part 1University of Illinois at Urbana-Champaign via Coursera Cloud Computing Concepts: Part 2
University of Illinois at Urbana-Champaign via Coursera Reliable Distributed Algorithms - Part 1
KTH Royal Institute of Technology via edX Introduction to Apache Spark and AWS
University of London International Programmes via Coursera Réalisez des calculs distribués sur des données massives
CentraleSupélec via OpenClassrooms