YoVDO

Turning an Incident Report into a Design Issue with TLA+

Offered By: USENIX via YouTube

Tags

SREcon Courses Concurrency Courses Incident Analysis Courses TLA+ Courses

Course Description

Overview

Explore how modeling-driven techniques can enhance postmortem analysis of high-impact outages in this 31-minute conference talk from SREcon23 Americas. Learn about the application of TLA+, a formal specification language, to precisely describe and analyze the behavior of micro-service architectures under concurrency. Discover how building a specification of Microsoft's CosmosDB helped identify the root cause of a long-lasting outage, uncovered bugs in documentation, and provided insights into underlying design issues. Gain valuable knowledge on using TLA+ to create precise and reusable specifications, potentially preventing similar incidents in the future and improving overall system reliability.

Syllabus

SREcon23 Americas - Turning an Incident Report into a Design Issue with TLA+


Taught by

USENIX

Related Courses

How to Not Destroy Your Production Kubernetes Clusters
USENIX via YouTube
SRE and ML - Why It Matters
USENIX via YouTube
Knowledge and Power - A Sociotechnical Systems Discussion on the Future of SRE
USENIX via YouTube
Tracing Bare Metal with OpenTelemetry
USENIX via YouTube
Improving How We Observe Our Observability Data - Techniques for SREs
USENIX via YouTube