Fast Copy-On-Write in Apache Parquet for Data Lakehouse Upserts
Offered By: Databricks via YouTube
Course Description
Overview
Discover a groundbreaking approach to efficient table ACID upserts in data lakehouses through this 35-minute conference talk. Learn about the implementation of partial copy-on-write within Parquet using row-level indexing to significantly improve upsert performance. Explore how this technique addresses critical use cases such as GDPR Right to be Forgotten and Change Data Capture, overcoming limitations in existing solutions like Apache Delta Lake, Iceberg, and Hudi. Understand the mechanics behind skipping unnecessary column chunks, resulting in up to 20x faster upserts compared to conventional methods. Gain insights from Mingmin Chen, Director of Engineering, and Xinli Shang, Engineering Manager at Uber Technologies, Inc., as they share their expertise on enhancing data lakehouse operations.
Syllabus
Fast Copy-On-Write in Apache Parquet for Data Lakehouse Upserts
Taught by
Databricks
Related Courses
Using Pandas and Dask to Work with Large Columnar Datasets in Apache ParquetEuroPython Conference via YouTube Building InfluxDB 3.0 with Apache Arrow, DataFusion, Flight and Parquet
Data Council via YouTube Ten Years of Building Open Source Standards in Data Engineering
Data Council via YouTube Time Series Analytics with Apache Arrow, Pandas, and Parquet - A 101 Introduction
Data Council via YouTube Ten Years of Building Open Source Standards: From Parquet to Arrow to OpenLineage
Data Council via YouTube