YoVDO

Confessions of a Systems Engineer - Learning from My 20+ Years of Failure

Offered By: USENIX via YouTube

Tags

SREcon Courses Service Level Agreements Courses

Course Description

Overview

Explore insights from over two decades of systems engineering experience in this 39-minute SREcon conversation with David Argent from Amazon. Gain valuable lessons learned from failures in designing and running large-scale online services. Discover key concepts such as minimizing change impact, implementing thorough monitoring, automating mitigations, and designing for quick incident resolution. Learn about the importance of regular process exercises, enforcing processes with technology, and understanding all supported scenarios. Benefit from Argent's diverse background spanning roles like Technical Writer, Systems Engineer, and Lead Problem Engineer across companies like Microsoft and Amazon.

Syllabus

Intro
There Are No Safe Changes
Minimize the Blast Radius on Changes
Monitor Accurately and Measure Thoroughly
Automate Mitigations
Degraded Service Modes, or An Imperfect Experience Usually Beats a Nonexistent One
Use Functional Gates Pre-, Post- and During Releases
Design to Meet SLAs and Mitigate Incidents Quickly
Regularly Exercise All Processes and Tools
Enforce Processes with Technology
Redirect or Drop Traffic Aggressively During Incidents
Production Quality Tools
Sanitize and verify Inputs
Understand All of the Scenarios You Support
Transition Service Responsibilities Carefully


Taught by

USENIX

Related Courses

Cloud Computing Engineering and Management
University System of Maryland via edX
Customer Service Fundamentals
IBM via Coursera
Gathering Non-functional Requirements for Microsoft Azure
Pluralsight
CCSK Cert Prep: 1 Cloud Architecture
LinkedIn Learning
Exam Prep: Microsoft Azure Fundamentals (AZ-900)
LinkedIn Learning