Using Pandas and Dask to Work with Large Columnar Datasets in Apache Parquet
Offered By: EuroPython Conference via YouTube
Course Description
Overview
Explore efficient techniques for handling large columnar datasets in Apache Parquet using Pandas and Dask in this EuroPython 2018 conference talk. Dive into the Apache Parquet data format, understanding its binary and columnar structure, as well as its CPU and I/O optimization techniques. Learn how to leverage row groups, compression, and dictionary encoding to enhance data storage and retrieval. Discover methods for reading Parquet files into Pandas DataFrames using fastparquet and Apache Arrow libraries. Gain insights into working with data larger than memory or local disk space using Apache Dask, including partitioning and cloud object storage systems like Amazon S3 and Azure Storage. Master techniques such as metadata utilization, partition filenames, column statistics, and dictionary filtering to boost query performance on extensive datasets. Understand the benefits of partitioning, row group skipping, and optimal data layout for accelerating queries on large-scale data.
Syllabus
Intro
Outline
Business Model
Data Flow
Conclusion
Why do I care
Other technologies
Blob storage
Data sharing
Pocky
Why Parquet
Python implementations
Parquet file structure
Pre predicate pushdown
Dictionary encoding
Compression
Partitioning
Storage
ODBC
Azure Blob Storage
Questions
Taught by
EuroPython Conference
Related Courses
A Brief History of Data StorageEuroPython Conference via YouTube Breaking the Stereotype - Evolution & Persistence of Gender Bias in Tech
EuroPython Conference via YouTube We Can Get More from Spatial, GIS, and Public Domain Datasets
EuroPython Conference via YouTube Using NLP to Detect Knots in Protein Structures
EuroPython Conference via YouTube The Challenges of Doing Infra-As-Code Without "The Cloud"
EuroPython Conference via YouTube