YoVDO

Sub-Region Failure - How to Handle the Partial Loss of a Data Center

Offered By: USENIX via YouTube

Tags

LISA (Large Installation System Administration) Conference Courses Disaster Recovery Courses Data Center Management Courses

Course Description

Overview

Explore how Facebook handles partial data center failures in this LISA19 conference talk. Learn about the Sub-Region Disaster Recovery initiative, which aims to keep data centers online during localized physical failures. Discover the development of an "auditor" that simulates power outages and understand the challenges of managing stateless, stateful, and storage systems during partial failures. Gain insights into testing methodologies, including intentional machine disconnections, and hear real-world stories about accidental power disruptions. Examine the impact of various failure types, from submarine cable disconnections to localized issues like power breaker failures and cooling system malfunctions. Understand the complexities of maintaining service availability in large-scale, geo-distributed data center environments and the strategies employed to minimize the impact of partial failures on overall operations.

Syllabus

Introduction
Hurricane Sandy
The wakeup call
The life of the request
Edge points of presence
Origin regions
Draining regions
Data center failures
Subregion failures
A few thousand servers lost power
The switchboard
Power panels
Fault domain
Drain region
What did it take out
The core problem
Which services will be impacted
Types of subregion failures
Single fault domain
Problem statement
Easy services
Constraints
Not everything is gravy
What happened next
Power Loss Siren
Power Failure
What I learned
Acknowledgement


Taught by

USENIX

Related Courses

Named Data Networking
USENIX via YouTube
Release Engineering Best Practices at Google
USENIX via YouTube
Efficiently Backing Up Terabytes of Data with PgBackRest
USENIX via YouTube
SRE in the Small and in the Large
USENIX via YouTube
Network-Based LUKS Volume Decryption with Tang
USENIX via YouTube