YoVDO

Recent Parquet Improvements in Apache Spark - Vectorized Complex Types and Column Index Support

Offered By: Databricks via YouTube

Tags

Apache Spark Courses Parquet Courses

Course Description

Overview

Explore recent improvements in Apache Parquet performance within Apache Spark in this 37-minute talk from Databricks. Learn about vectorized read support for complex types, which can achieve 10x+ improvement when reading Parquet data with complex structures. Discover how Parquet column index support enhances predicate pushdown capabilities, allowing Spark to leverage this feature for more efficient data filtering. Gain insights into the differences between vectorized and non-vectorized Parquet readers, understand the importance of predicate pushdown in optimizing scan performance, and get a glimpse of future work items aimed at further enhancing Parquet read performance in Spark. Delve into technical concepts such as Parquet schema conversion, complex type support, and column index filtering to deepen your understanding of these performance optimizations.

Syllabus

Intro
Short Intro
Outline
Introduction on Apache Parquet
Parquet: Glossary
Parquet: Data Page
Background
Non-Vectorized Parquet Reader
Advantages of Vectorized Approach
High Level Idea
Parquet Schema Conversion
SPARK-34863: Complex type support
Complex Type - Performance
Perf: vectorized vs non-vectorized
Parquet Predicate Pushdown
Column Index Filtering
Column Index Support in Spark
Column Index - Performance
Future Work


Taught by

Databricks

Related Courses

Python for Data Science Tips, Tricks, & Techniques
LinkedIn Learning
Sound Data Engineering in Rust - From Bits to DataFrames
Databricks via YouTube
Optimizing Spark SQL Jobs with Parallel and Asynchronous IO
Databricks via YouTube
Degrading Performance - Understanding and Solving Small Files Syndrome
Databricks via YouTube
The Apache Spark File Format Ecosystem - Optimizing Storage for Performance
Databricks via YouTube