YoVDO

Employing NumPy's NPY Format for Faster Than Parquet DataFrame Storage

Offered By: PyCon US via YouTube

Tags

PyCon US Courses Python Courses NumPy Courses JSON Courses DataFrames Courses Serialization Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the potential of NumPy's NPY format as a faster alternative to Parquet for DataFrame storage in this PyCon US talk. Dive into the challenges of serializing DataFrames and learn how a custom NPZ file format with JSON metadata can offer significant performance and compatibility advantages. Examine detailed read/write performance comparisons between Parquet and NPZ across various DataFrame shapes and dtype compositions. Discover techniques for optimizing Python routines for NPY file operations and explore applications for memory-mapping complete DataFrames using NPY representation. Gain insights into improving data science workflows and reducing compute costs through this innovative approach to DataFrame storage.

Syllabus

Intro
The Quest for Complete DataFrame Serialization
NumPy Enhancement Proposal (NEP) 1
Promising Performance of NPZ versus Parquet
Overview
Components of a DataFrame
Block-Consolidation Strategies Unconsolidated Blocks
Block Consolidation & Complexity
The NPY Format
Converting Contiguous Bytes to an Array
NPY & Object Arrays
NPY Versions
The NPZ Format
Encoding a DataFrame as an NPZ
JSON Metadata
NPY Performance in Numpy
Lies, Damned Lles, and Benchmarks
Nine DataFrame Fixtures
Memory Maps
Memory Mapping an Array
Memory Mapping a DataFrame
Current State
Future Work
Conclusions


Taught by

PyCon US

Related Courses

Intro to Python for Brand New Programmers
PyCon US via YouTube
Comprehending Comprehensions
PyCon US via YouTube
Data Analysis with SQLite and Python
PyCon US via YouTube
Build a Production Ready GraphQL API Using Python
PyCon US via YouTube
Web Development With A Python-backed Frontend - Featuring HTMX and Tailwind
PyCon US via YouTube