Fully Utilizing Spark for Data Validation with Fugue and Pandera
Offered By: Databricks via YouTube
Course Description
Overview
Explore data validation techniques for large-scale data pipelines in this 22-minute Databricks conference talk. Learn about the importance of data validation in interconnected data pipelines and compare popular frameworks like Great Expectations with lightweight alternatives. Discover how to extend Pandas-based validation libraries to Spark workflows using Fugue, an open-source framework. Gain insights into applying different validation rules for each partition in big data scenarios, addressing a common deficiency in current frameworks. Follow along with an interactive demo that combines Fugue and Pandera to create a flexible and efficient data validation solution for Spark. Understand the trade-offs between robust features and performance, and learn how to tailor your validation approach to your specific needs.
Syllabus
Intro
Case Study
Data Validation
Common Validations
Great Expectations - Detailed Results
Great Expectations - Data Documentation
Pandera-Sample Code
Comparison of Validation Frameworks
Fugue - Basic Code
Combining Fugue and Pandera
Example Data - Food Sloth's Pricing
Validation by Partition
Taught by
Databricks
Related Courses
Computational Investing, Part IGeorgia Institute of Technology via Coursera Введение в машинное обучение
Higher School of Economics via Coursera Математика и Python для анализа данных
Moscow Institute of Physics and Technology via Coursera Introduction to Python for Data Science
Microsoft via edX Python for Data Science
University of California, San Diego via edX