YoVDO

Sub-Region Failure - How to Handle the Partial Loss of a Data Center

Offered By: USENIX via YouTube

Tags

LISA (Large Installation System Administration) Conference Courses Disaster Recovery Courses Data Center Management Courses

Course Description

Overview

Explore how Facebook handles partial data center failures in this LISA19 conference talk. Learn about the Sub-Region Disaster Recovery initiative, which aims to keep data centers online during localized physical failures. Discover the development of an "auditor" that simulates power outages and understand the challenges of managing stateless, stateful, and storage systems during partial failures. Gain insights into testing methodologies, including intentional machine disconnections, and hear real-world stories about accidental power disruptions. Examine the impact of various failure types, from submarine cable disconnections to localized issues like power breaker failures and cooling system malfunctions. Understand the complexities of maintaining service availability in large-scale, geo-distributed data center environments and the strategies employed to minimize the impact of partial failures on overall operations.

Syllabus

Introduction
Hurricane Sandy
The wakeup call
The life of the request
Edge points of presence
Origin regions
Draining regions
Data center failures
Subregion failures
A few thousand servers lost power
The switchboard
Power panels
Fault domain
Drain region
What did it take out
The core problem
Which services will be impacted
Types of subregion failures
Single fault domain
Problem statement
Easy services
Constraints
Not everything is gravy
What happened next
Power Loss Siren
Power Failure
What I learned
Acknowledgement


Taught by

USENIX

Related Courses

Emergency Management
Open2Study
Resilience in Children Exposed to Trauma, Disaster and War: Global Perspectives
University of Minnesota via Coursera
MongoDB Advanced Deployment and Operations
MongoDB University
Arch403: Designing Resilient Schools
Build Academy via EdCast
Bases de données relationnelles : Comprendre pour maîtriser
Inria (French Institute for Research in Computer Science and Automation) via France Université Numerique