Site Reliability Engineering at Google
Offered By: GOTO Conferences via YouTube
Course Description
Overview
Explore the world of Site Reliability Engineering (SRE) at Google in this 51-minute conference talk from GOTO Berlin 2017. Dive into the challenges faced by Google's experts in operating their vast tech infrastructure and products. Learn how SRE treats operations as a software problem, addressing the enormous scale, rapid growth, and complexity of Google's systems. Discover key concepts such as Error Budgets, the 50% cap on Ops work, and the importance of keeping developers in the rotation. Gain insights into minimizing damage during outages, practicing incident response through the "Wheel of Misfortune," and implementing a post-mortem philosophy to prevent recurrence. Understand the unique approach of SRE in balancing development and operations, staffing, and managing operational overload. This talk provides valuable knowledge for those interested in modern infrastructure management and reliability engineering at scale.
Syllabus
Intro
Reliability is easy to take for granted
What is Site Reliability Engineering (SRE)?
Part I: Dev and Ops
Is conflict inevitable?
Service Level Agreement (SLA)
What do you spend your budget on?
The rule
Two nice features of Error Budgets
Part II: Staffing, Work, Ops Overload
SRE hires only coders
50% cap on Ops work
Keep DEV in the rotation
Speaking of Dev and Ops work...
SRE Portability
Part III: Death, taxes, and outages...
Minimize Damage
A word on practice...
Wheel of Misfortune
Prevent recurrence
Post-mortem philosophy
Summary
O'Reilly Book
Taught by
GOTO Conferences
Related Courses
Startup EngineeringStanford University via Coursera Developing Scalable Apps in Java
Google via Udacity Cloud Computing Concepts, Part 1
University of Illinois at Urbana-Champaign via Coursera Cloud Networking
University of Illinois at Urbana-Champaign via Coursera Cloud Computing Concepts: Part 2
University of Illinois at Urbana-Champaign via Coursera