Small Big Data - Using NumPy and Pandas When Your Data Doesn't Fit in Memory
Offered By: PyCon US via YouTube
Course Description
Overview
Learn techniques for handling datasets too large for memory but too small for Big Data clusters in this 26-minute PyCon US talk. Discover how to process Small Big Data efficiently using NumPy and Pandas through money-saving strategies, compression techniques, batching methods, and indexing approaches. Explore practical solutions like utilizing Numpy dtypes, sparse arrays, and Pandas dtypes for compression, implementing chunking with Zarr and Pandas, and leveraging SQLite for indexing. Gain insights applicable to other libraries and specific data scenarios, empowering you to tackle data processing challenges effectively.
Syllabus
Small Big Data
Prelude: the most important question
TIME FOR A BIG DATA CLUSTER!!!!
A non-solution: don't use RAM, just disk
The software solution: use less RAM
Compression: Numpy dtypes
Compression: sparse arrays
Compression: Pandas dtypes When loading data you can specify types
Chunking: loading Numpy chunks with Zarr
Chunking: with Pandas
Indexing: the simplest solution
Indexing: Pandas without indexing
Indexing: populate SQLite from Pandas
Indexing: load from SQLite into DataFrame
Indexing: SQLite vs. CSV
Conclusion: what about other libraries?
Conclusion: don't forget about
Taught by
PyCon US
Related Courses
Computational Investing, Part IGeorgia Institute of Technology via Coursera Введение в машинное обучение
Higher School of Economics via Coursera Математика и Python для анализа данных
Moscow Institute of Physics and Technology via Coursera Introduction to Python for Data Science
Microsoft via edX Using Python for Research
Harvard University via edX