YoVDO

Using Pandas and Dask to Work with Large Columnar Datasets in Apache Parquet

Offered By: EuroPython Conference via YouTube

Tags

EuroPython Courses Data Analysis Courses pandas Courses Dask Courses Apache Parquet Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore efficient techniques for handling large columnar datasets in Apache Parquet using Pandas and Dask in this EuroPython 2018 conference talk. Dive into the Apache Parquet data format, understanding its binary and columnar structure, as well as its CPU and I/O optimization techniques. Learn how to leverage row groups, compression, and dictionary encoding to enhance data storage and retrieval. Discover methods for reading Parquet files into Pandas DataFrames using fastparquet and Apache Arrow libraries. Gain insights into working with data larger than memory or local disk space using Apache Dask, including partitioning and cloud object storage systems like Amazon S3 and Azure Storage. Master techniques such as metadata utilization, partition filenames, column statistics, and dictionary filtering to boost query performance on extensive datasets. Understand the benefits of partitioning, row group skipping, and optimal data layout for accelerating queries on large-scale data.

Syllabus

Intro
Outline
Business Model
Data Flow
Conclusion
Why do I care
Other technologies
Blob storage
Data sharing
Pocky
Why Parquet
Python implementations
Parquet file structure
Pre predicate pushdown
Dictionary encoding
Compression
Partitioning
Storage
ODBC
Azure Blob Storage
Questions


Taught by

EuroPython Conference

Related Courses

Parallel Programming with Dask in Python
DataCamp
Scaling Python Data Applications with Dask
Pluralsight
Trabajando con Dask
Coursera Project Network via Coursera
Faster pandas
LinkedIn Learning
Parallel Programming with Dask in Python
DataCamp