YoVDO

Are We All on the Same Page? Let's Fix That

Offered By: USENIX via YouTube

Tags

SREcon Courses Distributed Systems Courses Observability Courses

Course Description

Overview

Explore a 39-minute conference talk from SREcon19 Europe/Middle East/Africa that addresses the challenge of managing alerts in complex distributed systems. Learn about Adaptive Paging, an innovative alert handler that utilizes tracing and OpenTracing's semantic conventions to identify the most probable cause of issues and page the appropriate team. Discover how this approach can reduce alert fatigue and improve incident response in organizations with multiple teams. The talk covers topics such as the evolution from monoliths to distributed systems, the changing roles in DevOps, and the implementation of Adaptive Paging. Gain insights into the challenges of observability and see real-world examples of how this solution can be applied during outages, including network partial failures.

Syllabus

Introduction
Monoliths
Ops dev silos
New roles
The solution
Adaptive Paging
Alert Handler Example
Outage Example
Challenges
Observability
Questions
Network Partial Outage


Taught by

USENIX

Related Courses

How to Not Destroy Your Production Kubernetes Clusters
USENIX via YouTube
SRE and ML - Why It Matters
USENIX via YouTube
Knowledge and Power - A Sociotechnical Systems Discussion on the Future of SRE
USENIX via YouTube
Tracing Bare Metal with OpenTelemetry
USENIX via YouTube
Improving How We Observe Our Observability Data - Techniques for SREs
USENIX via YouTube