YoVDO

Evolution of Incident Management at Slack

Offered By: USENIX via YouTube

Tags

SREcon Courses Crisis Management Courses Reliability Engineering Courses Process Improvement Courses Incident Management Courses

Course Description

Overview

Explore the evolution of incident management at Slack in this 28-minute conference talk from SREcon21. Discover how the company handles dozens of incidents weekly while delivering over 150 million messages per minute at peak. Learn about Slack's journey to make incident management a core capability for their entire engineering team, including their history, reliability crisis, and vision for incident management. Gain insights into their incident management plan, training, severity levels, and the roles of Major Incident Commanders. Understand how Slack manages simultaneous incidents, implements Area Command, and handles long-duration and pillar incidents. Examine ongoing challenges, recruitment and training strategies, and the impact of success on incident management practices.

Syllabus

Intro
History of Slack
Reliability Crisis
Incident Management Vision
Incident Management Plan
Incident Management Training
Severity Levels
Major IC
Major IC oncall
Major IC responsibility
Simultaneous incidents
Area Command
Long Duration Incidents
Pillar Incidents
Whats Next
Ongoing Challenges
Recruitment and Training
Challenge of Success


Taught by

USENIX

Related Courses

How to Not Destroy Your Production Kubernetes Clusters
USENIX via YouTube
SRE and ML - Why It Matters
USENIX via YouTube
Knowledge and Power - A Sociotechnical Systems Discussion on the Future of SRE
USENIX via YouTube
Tracing Bare Metal with OpenTelemetry
USENIX via YouTube
Improving How We Observe Our Observability Data - Techniques for SREs
USENIX via YouTube