Confessions of a Systems Engineer - Learning from My 20+ Years of Failure
Offered By: USENIX via YouTube
Course Description
Overview
Explore insights from over two decades of systems engineering experience in this 39-minute SREcon conversation with David Argent from Amazon. Gain valuable lessons learned from failures in designing and running large-scale online services. Discover key concepts such as minimizing change impact, implementing thorough monitoring, automating mitigations, and designing for quick incident resolution. Learn about the importance of regular process exercises, enforcing processes with technology, and understanding all supported scenarios. Benefit from Argent's diverse background spanning roles like Technical Writer, Systems Engineer, and Lead Problem Engineer across companies like Microsoft and Amazon.
Syllabus
Intro
There Are No Safe Changes
Minimize the Blast Radius on Changes
Monitor Accurately and Measure Thoroughly
Automate Mitigations
Degraded Service Modes, or An Imperfect Experience Usually Beats a Nonexistent One
Use Functional Gates Pre-, Post- and During Releases
Design to Meet SLAs and Mitigate Incidents Quickly
Regularly Exercise All Processes and Tools
Enforce Processes with Technology
Redirect or Drop Traffic Aggressively During Incidents
Production Quality Tools
Sanitize and verify Inputs
Understand All of the Scenarios You Support
Transition Service Responsibilities Carefully
Taught by
USENIX
Related Courses
Cloud Computing Engineering and ManagementUniversity System of Maryland via edX Customer Service Fundamentals
IBM via Coursera Gathering Non-functional Requirements for Microsoft Azure
Pluralsight CCSK Cert Prep: 1 Cloud Architecture
LinkedIn Learning Exam Prep: Microsoft Azure Fundamentals (AZ-900)
LinkedIn Learning