YoVDO

Fully Utilizing Spark for Data Validation with Fugue and Pandera

Offered By: Databricks via YouTube

Tags

Apache Spark Courses Big Data Courses pandas Courses Data Validation Courses Data Pipelines Courses

Course Description

Overview

Explore data validation techniques for large-scale data pipelines in this 22-minute Databricks conference talk. Learn about the importance of data validation in interconnected data pipelines and compare popular frameworks like Great Expectations with lightweight alternatives. Discover how to extend Pandas-based validation libraries to Spark workflows using Fugue, an open-source framework. Gain insights into applying different validation rules for each partition in big data scenarios, addressing a common deficiency in current frameworks. Follow along with an interactive demo that combines Fugue and Pandera to create a flexible and efficient data validation solution for Spark. Understand the trade-offs between robust features and performance, and learn how to tailor your validation approach to your specific needs.

Syllabus

Intro
Case Study
Data Validation
Common Validations
Great Expectations - Detailed Results
Great Expectations - Data Documentation
Pandera-Sample Code
Comparison of Validation Frameworks
Fugue - Basic Code
Combining Fugue and Pandera
Example Data - Food Sloth's Pricing
Validation by Partition


Taught by

Databricks

Related Courses

Computational Investing, Part I
Georgia Institute of Technology via Coursera
Введение в машинное обучение
Higher School of Economics via Coursera
Математика и Python для анализа данных
Moscow Institute of Physics and Technology via Coursera
Introduction to Python for Data Science
Microsoft via edX
Python for Data Science
University of California, San Diego via edX