How We Managed a Massive-scale Incident at Datadog
Offered By: USENIX via YouTube
Course Description
Overview
Explore a detailed account of Datadog's massive global outage on March 8, 2023, in this 40-minute conference talk from SREcon23 Europe/Middle East/Africa. Learn about the incident's trigger, the extensive recovery efforts, and the technical challenges faced during the crisis. Gain insights into how Datadog successfully coordinated over 500 engineers for more than two days of continuous incident response. Discover the technical lessons learned, innovative solutions implemented, and the organizational strategies that enabled such a large-scale response with minimal heroism. Understand how to build and prepare an engineering team capable of handling major incidents effectively.
Syllabus
SREcon23 Europe/Middle East/Africa - The World Blew Up but We’re All Okay: How We Managed a...
Taught by
USENIX
Related Courses
Emergency ManagementOpen2Study Resilience in Children Exposed to Trauma, Disaster and War: Global Perspectives
University of Minnesota via Coursera MongoDB Advanced Deployment and Operations
MongoDB University Arch403: Designing Resilient Schools
Build Academy via EdCast Bases de données relationnelles : Comprendre pour maîtriser
Inria (French Institute for Research in Computer Science and Automation) via France Université Numerique