YoVDO

Metastable Failures in the Wild

Offered By: USENIX via YouTube

Tags

OSDI (Operating Systems Design and Implementation) Courses Distributed Systems Courses Reliability Engineering Courses Vulnerability Assessment Courses Incident Analysis Courses

Course Description

Overview

Explore a comprehensive analysis of metastable failures in distributed systems through this 16-minute conference talk from OSDI '22. Delve into the prevalence and impact of these failures across various organizations, from small companies to hyperscalers. Discover the extended model of metastable failures, including two types of triggers and amplification mechanisms. Learn about real-world examples and their implications for system design and reliability. Gain insights into the recurring patterns of metastable failures in major outages and understand their significance in the field of distributed systems. Examine the researchers' findings from studying 22 metastable failures across 11 different organizations, and explore their reproduced examples in controlled environments. Enhance your understanding of this critical issue in distributed systems and its potential solutions.

Syllabus

Intro
What are Metastable Failures?
Metastable Failures are Prevalent
Metastability in the Wild - Survey
Defining Metastability - System States
Survey Summary
Metastability Taxonomy - Trigger
Metastability Taxonomy - Sustaining ef
Four Metastability Scenarios Load-spike trigger
Degrees of Vulnerabilities
Lessons
Conclusion


Taught by

USENIX

Related Courses

Failure Analysis And Prevention
Indian Institute of Technology Roorkee via Swayam
Reliable Cloud Infrastructure: Design and Process en Français
Google Cloud via Coursera
Reliability in Engineering Design
Purdue University via edX
Reliable Google Cloud Infrastructure: Design and Process
Pluralsight
Reliable Google Cloud Infrastructure: Design and Process
Pluralsight