YoVDO

How We Managed a Massive-scale Incident at Datadog

Offered By: USENIX via YouTube

Tags

Incident Management Courses Datadog Courses Disaster Recovery Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a detailed account of Datadog's massive global outage on March 8, 2023, in this 40-minute conference talk from SREcon23 Europe/Middle East/Africa. Learn about the incident's trigger, the extensive recovery efforts, and the technical challenges faced during the crisis. Gain insights into how Datadog successfully coordinated over 500 engineers for more than two days of continuous incident response. Discover the technical lessons learned, innovative solutions implemented, and the organizational strategies that enabled such a large-scale response with minimal heroism. Understand how to build and prepare an engineering team capable of handling major incidents effectively.

Syllabus

SREcon23 Europe/Middle East/Africa - The World Blew Up but We’re All Okay: How We Managed a...


Taught by

USENIX

Related Courses

AWS Certified Security - Specialty 2020
A Cloud Guru
Google Cloud DevOps and SREs (GCP DevOps Engineer Track Part 2)
A Cloud Guru
Aplicación del conector ServiceNow (Español LATAM) | ServiceNow Connector Application (LATAM Spanish)
Amazon Web Services via AWS Skill Builder
Aplicación del conector ServiceNow (Español LATAM) | ServiceNow Connector Application (Spanish from Latin America)
Amazon Web Services via AWS Skill Builder
Aplicación del conector ServiceNow (Español LATAM) | ServiceNow Connector Application (Spanish from Latin America)
Amazon Web Services via AWS Skill Builder