Using Pandas and Dask to Work with Large Columnar Datasets in Apache Parquet
Offered By: EuroPython Conference via YouTube
Course Description
Overview
Explore efficient techniques for handling large columnar datasets in Apache Parquet using Pandas and Dask in this EuroPython 2018 conference talk. Dive into the Apache Parquet data format, understanding its binary and columnar structure, as well as its CPU and I/O optimization techniques. Learn how to leverage row groups, compression, and dictionary encoding to enhance data storage and retrieval. Discover methods for reading Parquet files into Pandas DataFrames using fastparquet and Apache Arrow libraries. Gain insights into working with data larger than memory or local disk space using Apache Dask, including partitioning and cloud object storage systems like Amazon S3 and Azure Storage. Master techniques such as metadata utilization, partition filenames, column statistics, and dictionary filtering to boost query performance on extensive datasets. Understand the benefits of partitioning, row group skipping, and optimal data layout for accelerating queries on large-scale data.
Syllabus
Intro
Outline
Business Model
Data Flow
Conclusion
Why do I care
Other technologies
Blob storage
Data sharing
Pocky
Why Parquet
Python implementations
Parquet file structure
Pre predicate pushdown
Dictionary encoding
Compression
Partitioning
Storage
ODBC
Azure Blob Storage
Questions
Taught by
EuroPython Conference
Related Courses
Fast Copy-On-Write in Apache Parquet for Data Lakehouse UpsertsDatabricks via YouTube Building InfluxDB 3.0 with Apache Arrow, DataFusion, Flight and Parquet
Data Council via YouTube Ten Years of Building Open Source Standards in Data Engineering
Data Council via YouTube Time Series Analytics with Apache Arrow, Pandas, and Parquet - A 101 Introduction
Data Council via YouTube Ten Years of Building Open Source Standards: From Parquet to Arrow to OpenLineage
Data Council via YouTube