Fully Utilizing Spark for Data Validation with Fugue and Pandera
Offered By: Databricks via YouTube
Course Description
Overview
Explore data validation techniques for large-scale data pipelines in this 22-minute Databricks conference talk. Learn about the importance of data validation in interconnected data pipelines and compare popular frameworks like Great Expectations with lightweight alternatives. Discover how to extend Pandas-based validation libraries to Spark workflows using Fugue, an open-source framework. Gain insights into applying different validation rules for each partition in big data scenarios, addressing a common deficiency in current frameworks. Follow along with an interactive demo that combines Fugue and Pandera to create a flexible and efficient data validation solution for Spark. Understand the trade-offs between robust features and performance, and learn how to tailor your validation approach to your specific needs.
Syllabus
Intro
Case Study
Data Validation
Common Validations
Great Expectations - Detailed Results
Great Expectations - Data Documentation
Pandera-Sample Code
Comparison of Validation Frameworks
Fugue - Basic Code
Combining Fugue and Pandera
Example Data - Food Sloth's Pricing
Validation by Partition
Taught by
Databricks
Related Courses
Rails with Active Record and Action PackJohns Hopkins University via Coursera Excel Skills for Business: Intermediate II
Macquarie University via Coursera Programming 103: Saving and Structuring Data
Raspberry Pi Foundation via FutureLearn Everyday Excel, Part 1
University of Colorado Boulder via Coursera Creating Dashboards in Google Spreadsheets
Coursera Project Network via Coursera