YoVDO

Fully Utilizing Spark for Data Validation with Fugue and Pandera

Offered By: Databricks via YouTube

Tags

Apache Spark Courses Big Data Courses pandas Courses Data Validation Courses Data Pipelines Courses

Course Description

Overview

Explore data validation techniques for large-scale data pipelines in this 22-minute Databricks conference talk. Learn about the importance of data validation in interconnected data pipelines and compare popular frameworks like Great Expectations with lightweight alternatives. Discover how to extend Pandas-based validation libraries to Spark workflows using Fugue, an open-source framework. Gain insights into applying different validation rules for each partition in big data scenarios, addressing a common deficiency in current frameworks. Follow along with an interactive demo that combines Fugue and Pandera to create a flexible and efficient data validation solution for Spark. Understand the trade-offs between robust features and performance, and learn how to tailor your validation approach to your specific needs.

Syllabus

Intro
Case Study
Data Validation
Common Validations
Great Expectations - Detailed Results
Great Expectations - Data Documentation
Pandera-Sample Code
Comparison of Validation Frameworks
Fugue - Basic Code
Combining Fugue and Pandera
Example Data - Food Sloth's Pricing
Validation by Partition


Taught by

Databricks

Related Courses

Rails with Active Record and Action Pack
Johns Hopkins University via Coursera
Excel Skills for Business: Intermediate II
Macquarie University via Coursera
Programming 103: Saving and Structuring Data
Raspberry Pi Foundation via FutureLearn
Everyday Excel, Part 1
University of Colorado Boulder via Coursera
Creating Dashboards in Google Spreadsheets
Coursera Project Network via Coursera