YoVDO

The Parquet Format and Performance Optimization Opportunities

Offered By: Databricks via YouTube

Tags

Parquet Courses Apache Spark Courses File Management Courses Data Analytics Courses Delta Lake Courses Columnar Storage Courses

Course Description

Overview

Dive into the intricacies of the Parquet format and explore performance optimization opportunities in this 41-minute conference talk by Boudewijn Braams from Databricks. Begin with an introduction to structured data formats and physical data storage models, including row-wise, columnar, and hybrid approaches. Delve deeper into the specifics of the Parquet format, examining its disk representation, physical data organization, and encoding schemes. Learn about various performance optimization techniques such as dictionary encoding, page compression, predicate pushdown, dictionary filtering, and partitioning schemes. Discover strategies to combat the issue of 'many small files' and gain insights into the open-source Delta Lake format in relation to Parquet. Suitable for both newcomers seeking an approachable refresher on columnar storage and experienced professionals looking to optimize analytical workloads in Spark, this talk provides tangible tips and tricks to leverage the Parquet format for improved performance.

Syllabus

Intro
Data processing and analytics
Overview
Data sources and formats
Physical storage layout models
Different workloads
Row-wise vs Columnar
Parquet: data organization Data organization
Parquet: encoding schemes
Optimization: dictionary encoding
Optimization: predicate pushdown
Optimization: partitioning • Embed predicates in directory structure
Optimization: avoid many small files
Optimization: avoid few huge files
Optimization: Delta Lake • Open-source storage layer on top of Parquet in Spark
Conclusion


Taught by

Databricks

Related Courses

Distributed Computing with Spark SQL
University of California, Davis via Coursera
Apache Spark (TM) SQL for Data Analysts
Databricks via Coursera
Building Your First ETL Pipeline Using Azure Databricks
Pluralsight
Implement a data lakehouse analytics solution with Azure Databricks
Microsoft via Microsoft Learn
Perform data science with Azure Databricks
Microsoft via Microsoft Learn