YoVDO

Large Scale Data Validation - with Spark and Dask

Offered By: PyCon US via YouTube

Tags

PyCon US Courses pandas Courses Distributed Computing Courses Data Validation Courses Dask Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore large-scale data validation techniques using Spark and Dask in this informative PyCon US talk. Discover how data validation ensures the reliability of data pipelines and preserves workflow integrity. Learn about the differences between single-machine and distributed computing settings for data validation, and understand which validations become more computationally expensive in Spark and Dask. Examine the need for applying different validations on various data partitions and how to achieve this by combining different frameworks. Follow a fictitious case study that demonstrates the data validation journey, starting with small-scale Pandas-based validations using Pandera and Great Expectations, and progressing to distributed settings. Gain insights into the challenges of transitioning to large-scale data validation and learn how to reuse Pandas-based validations on Spark and Dask using Fugue. This 26-minute talk provides valuable knowledge for data scientists and engineers working with increasing data volumes and complex validation requirements.

Syllabus

TALK / Kevin Kho / Large Scale Data Validation (with Spark and Dask)


Taught by

PyCon US

Related Courses

Cloud Computing Concepts, Part 1
University of Illinois at Urbana-Champaign via Coursera
Cloud Computing Concepts: Part 2
University of Illinois at Urbana-Champaign via Coursera
Reliable Distributed Algorithms - Part 1
KTH Royal Institute of Technology via edX
Introduction to Apache Spark and AWS
University of London International Programmes via Coursera
Réalisez des calculs distribués sur des données massives
CentraleSupélec via OpenClassrooms