YoVDO

Troubleshooting Tiered Tragedy - A Peek Into Failure

Offered By: GOTO Conferences via YouTube

Tags

GOTO Conferences Courses Distributed Systems Courses Failure Analysis Courses Centralized Logging Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a real-world failure in a distributed system and the troubleshooting process involved in this 42-minute conference talk from GOTO Chicago 2017. Follow Jeff Smith, Manager of Production Operations at Centro, as he dissects the anatomy of a system, defines failure modes, and walks through the signs of trouble. Learn about shutting down systems, identifying changes, rolling back, and uncovering missed issues. Gain insights on the importance of sensible defaults, contextual metrics, centralized logging, and alert management. Conclude with a recap and Q&A session to deepen your understanding of handling complex system failures in DevOps environments.

Syllabus

Introduction
Anatomy of a System
System Definition
Failure Modes
Failure Walkthrough
Signs of Trouble
Shutting It Down
What Changed
Roll It Back
What We Missed
A sensible default
What we learned
Metrics need context
Centralized logging
Losing alerts
Recap
Questions


Taught by

GOTO Conferences

Related Courses

Forensic Engineering: Learning from Failures
Delft University of Technology via edX
Failure Analysis And Prevention
Indian Institute of Technology Roorkee via Swayam
Dynamic Behaviour of Materials
Indian Institute of Technology Guwahati via Swayam
Principles of Metal Forming Technology
Indian Institute of Technology Roorkee via Swayam
Success and Failure in Entrepreneurship: Discover the Key to Business Success
Coventry University via FutureLearn