Why We Built Our Own Distributed Column Store
Offered By: Strange Loop Conference via YouTube
Course Description
Overview
Explore the architecture and implementation of Retriever, a custom-built distributed column store database, in this 43-minute Strange Loop Conference talk. Learn how Honeycomb addressed the challenges of understanding complex distributed systems in production by developing a low-latency, schemaless database inspired by Facebook's Scuba. Discover the design decisions behind Retriever, including its use of disk storage, efficient column-oriented storage model, and ability to handle multi-tenancy and cost constraints. Gain insights into the write and read paths, data model, storage format, distributed queries, and fault tolerance mechanisms. Understand how Retriever ingests events from Kafka, manages quotas, and handles failure recovery. Delve into the lessons learned from operating a hand-rolled database at production scale with paying customers, and see how it compares to other solutions for sub-second complex queries over large data volumes in real time.
Syllabus
Intro
Please meet Retriever
Retriever is a special purpose data store
What is Honeycomb?
How Honeycomb works
Honeycomb under the hood
Our requirements
Requirements - summary
Retriever at a glance
Retriever compared to Scuba
Architecture - write path
Architecture - read path
Data model - datasets
Data model - events
Row oriented storage
Column oriented storage
Storage Format - timestamp column
Storage Format - reading
Distributed queries
Distributed reads - calculations
Distributed reads - fanout
Detour - Kafka
Ingestion
Quota management
Fault tolerance
Failure recovery
Bootstrapping new nodes
Taught by
Strange Loop Conference
Tags
Related Courses
Advanced Operating SystemsGeorgia Institute of Technology via Udacity High Performance Computing
Georgia Institute of Technology via Udacity GT - Refresher - Advanced OS
Georgia Institute of Technology via Udacity Distributed Machine Learning with Apache Spark
University of California, Berkeley via edX CS125x: Advanced Distributed Machine Learning with Apache Spark
University of California, Berkeley via edX