YoVDO

Metastable Failures in the Wild

Offered By: USENIX via YouTube

Tags

OSDI (Operating Systems Design and Implementation) Courses Distributed Systems Courses Reliability Engineering Courses Vulnerability Assessment Courses Incident Analysis Courses

Course Description

Overview

Explore a comprehensive analysis of metastable failures in distributed systems through this 16-minute conference talk from OSDI '22. Delve into the prevalence and impact of these failures across various organizations, from small companies to hyperscalers. Discover the extended model of metastable failures, including two types of triggers and amplification mechanisms. Learn about real-world examples and their implications for system design and reliability. Gain insights into the recurring patterns of metastable failures in major outages and understand their significance in the field of distributed systems. Examine the researchers' findings from studying 22 metastable failures across 11 different organizations, and explore their reproduced examples in controlled environments. Enhance your understanding of this critical issue in distributed systems and its potential solutions.

Syllabus

Intro
What are Metastable Failures?
Metastable Failures are Prevalent
Metastability in the Wild - Survey
Defining Metastability - System States
Survey Summary
Metastability Taxonomy - Trigger
Metastability Taxonomy - Sustaining ef
Four Metastability Scenarios Load-spike trigger
Degrees of Vulnerabilities
Lessons
Conclusion


Taught by

USENIX

Related Courses

GraphX - Graph Processing in a Distributed Dataflow Framework
USENIX via YouTube
Theseus - An Experiment in Operating System Structure and State Management
USENIX via YouTube
RedLeaf - Isolation and Communication in a Safe Operating System
USENIX via YouTube
Microsecond Consensus for Microsecond Applications
USENIX via YouTube
KungFu - Making Training in Distributed Machine Learning Adaptive
USENIX via YouTube