Using Pandas and Dask to Work with Large Columnar Datasets in Apache Parquet

Offered By: EuroPython Conference via YouTube

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Explore efficient techniques for handling large columnar datasets in Apache Parquet using Pandas and Dask in this EuroPython 2018 conference talk. Dive into the Apache Parquet data format, understanding its binary and columnar structure, as well as its CPU and I/O optimization techniques. Learn how to leverage row groups, compression, and dictionary encoding to enhance data storage and retrieval. Discover methods for reading Parquet files into Pandas DataFrames using fastparquet and Apache Arrow libraries. Gain insights into working with data larger than memory or local disk space using Apache Dask, including partitioning and cloud object storage systems like Amazon S3 and Azure Storage. Master techniques such as metadata utilization, partition filenames, column statistics, and dictionary filtering to boost query performance on extensive datasets. Understand the benefits of partitioning, row group skipping, and optimal data layout for accelerating queries on large-scale data.

Syllabus

Intro
Outline
Business Model
Data Flow
Conclusion
Why do I care
Other technologies
Blob storage
Data sharing
Pocky
Why Parquet
Python implementations
Parquet file structure
Pre predicate pushdown
Dictionary encoding
Compression
Partitioning
Storage
ODBC
Azure Blob Storage
Questions

Taught by

EuroPython Conference

Using Pandas and Dask to Work with Large Columnar Datasets in Apache Parquet

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Using Pandas and Dask to Work with Large Columnar Datasets in Apache Parquet

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Login to Continue