Evolution of Incident Management at Slack

Offered By: USENIX via YouTube

Course Description

Overview

Explore the evolution of incident management at Slack in this 28-minute conference talk from SREcon21. Discover how the company handles dozens of incidents weekly while delivering over 150 million messages per minute at peak. Learn about Slack's journey to make incident management a core capability for their entire engineering team, including their history, reliability crisis, and vision for incident management. Gain insights into their incident management plan, training, severity levels, and the roles of Major Incident Commanders. Understand how Slack manages simultaneous incidents, implements Area Command, and handles long-duration and pillar incidents. Examine ongoing challenges, recruitment and training strategies, and the impact of success on incident management practices.

Syllabus

Intro
History of Slack
Reliability Crisis
Incident Management Vision
Incident Management Plan
Incident Management Training
Severity Levels
Major IC
Major IC oncall
Major IC responsibility
Simultaneous incidents
Area Command
Long Duration Incidents
Pillar Incidents
Whats Next
Ongoing Challenges
Recruitment and Training
Challenge of Success

Taught by

USENIX

Evolution of Incident Management at Slack

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Evolution of Incident Management at Slack

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Login to Continue