YoVDO

Can Applications Recover from fsync Failures?

Offered By: USENIX via YouTube

Tags

USENIX Annual Technical Conference Courses Linux Courses Fault Injection Courses

Course Description

Overview

Explore the intricacies of fsync failures and their impact on file systems and data-intensive applications in this USENIX ATC '20 conference talk. Delve into a comprehensive analysis of how ext4, XFS, and Btrfs file systems react to fsync failures, uncovering commonalities and differences in their behavior. Examine the failure-handling strategies employed by popular applications like PostgreSQL, LMDB, LevelDB, SQLite, and Redis, and discover why these approaches fall short in preventing catastrophic outcomes such as data loss and corruption. Learn about the implications of these findings for designing file systems and applications that aim to provide robust durability guarantees. Gain insights into the challenges of achieving true data durability and the potential directions for improvement in this critical area of computer science.

Syllabus

Intro
How does data reach the disk?
fsync is really important
It's hard to get durability correct Applications find it difficult
fsync can fail Durability gets harder to get right
Why care about fsync failures? "About a year ago the PostgreSQL community discovered that fsync (on Linux and some BSD systems) may not work the way we always thought it is [sic], with possibly disastrous consequences for data durability/consistency (which is something the PostgreSQL community really values)."
Our work Systematically understand fsync failures
File System Results
Application Results
Outline
File System | Methodology: Fault Injection
File System Methodology: Workloads Common write patterns in applications • Reduced to simplest form
File System Result #1: Clean Pages Dirty page is marked clean after fsync failure on all three file systems
File System Result #22: Page Content File systems do not handle fsync errors uniformly • Page content depends on file system
File System Result #3: In-memory state In-memory data structures are not entirely reverted
Applications Five widely used applications
Applications Results: Overview Ext4 Ordered Mode
Crash/Restart Simple strategies fail Crash/restart is incorrect recovers wrong data from page cache • Example: PostgreSQL
Applications Results #1: False Failures False Failures: Indicate failure but actually succeed
Late Error Reporting All applications susceptible to data loss on ext4 data mode
Btrfs winning?
Applications Results Summary Simple strategies fail • Applications have moved away from retries
Challenges and Directions


Taught by

USENIX

Related Courses

Amazon DynamoDB - A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service
USENIX via YouTube
Faasm - Lightweight Isolation for Efficient Stateful Serverless Computing
USENIX via YouTube
AC-Key - Adaptive Caching for LSM-based Key-Value Stores
USENIX via YouTube
The Future of the Past - Challenges in Archival Storage
USENIX via YouTube
A Decentralized Blockchain with High Throughput and Fast Confirmation
USENIX via YouTube