Recent Parquet Improvements in Apache Spark - Vectorized Complex Types and Column Index Support
Offered By: Databricks via YouTube
Course Description
Overview
Explore recent improvements in Apache Parquet performance within Apache Spark in this 37-minute talk from Databricks. Learn about vectorized read support for complex types, which can achieve 10x+ improvement when reading Parquet data with complex structures. Discover how Parquet column index support enhances predicate pushdown capabilities, allowing Spark to leverage this feature for more efficient data filtering. Gain insights into the differences between vectorized and non-vectorized Parquet readers, understand the importance of predicate pushdown in optimizing scan performance, and get a glimpse of future work items aimed at further enhancing Parquet read performance in Spark. Delve into technical concepts such as Parquet schema conversion, complex type support, and column index filtering to deepen your understanding of these performance optimizations.
Syllabus
Intro
Short Intro
Outline
Introduction on Apache Parquet
Parquet: Glossary
Parquet: Data Page
Background
Non-Vectorized Parquet Reader
Advantages of Vectorized Approach
High Level Idea
Parquet Schema Conversion
SPARK-34863: Complex type support
Complex Type - Performance
Perf: vectorized vs non-vectorized
Parquet Predicate Pushdown
Column Index Filtering
Column Index Support in Spark
Column Index - Performance
Future Work
Taught by
Databricks
Related Courses
CS115x: Advanced Apache Spark for Data Science and Data EngineeringUniversity of California, Berkeley via edX Big Data Analytics
University of Adelaide via edX Big Data Essentials: HDFS, MapReduce and Spark RDD
Yandex via Coursera Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames
Yandex via Coursera Introduction to Apache Spark and AWS
University of London International Programmes via Coursera