YoVDO

Anatomy of Reading Apache Parquet Files in Apache Impala

Offered By: The ASF via YouTube

Tags

Big Data Courses C++ Courses Apache Parquet Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the intricacies of reading Apache Parquet files from the perspective of Apache Impala in this 26-minute conference talk. Delve into the crucial early stages of query execution in Apache Impala, focusing on the process from reading bytes of Parquet files on the filesystem to applying predicates and runtime filters on individual rows. Learn about Apache Impala's distributed massively parallel analytic query engine, optimized for both object stores and on-premises distributed file systems. Discover why Impala uses its own C++ Parquet scanner instead of existing libraries, enabling features like data caching, execution within memory bounds, and efficient parallelism. Gain insights into how these features give Impala an edge in the world of Big Data query engines. Presented by Csaba Ringhofer and Daniel Becker, experienced software engineers from Cloudera and members of the Apache Impala PMC, this talk offers valuable knowledge for those working with big data systems and file formats.

Syllabus

Anatomy of reading Apache Parquet files (from the Apache Impala perspective)


Taught by

The ASF

Related Courses

Using Pandas and Dask to Work with Large Columnar Datasets in Apache Parquet
EuroPython Conference via YouTube
Fast Copy-On-Write in Apache Parquet for Data Lakehouse Upserts
Databricks via YouTube
Building InfluxDB 3.0 with Apache Arrow, DataFusion, Flight and Parquet
Data Council via YouTube
Ten Years of Building Open Source Standards in Data Engineering
Data Council via YouTube
Time Series Analytics with Apache Arrow, Pandas, and Parquet - A 101 Introduction
Data Council via YouTube