YoVDO

Building Data Quality Pipelines with Apache Spark and Delta Lake

Offered By: Databricks via YouTube

Tags

Apache Spark Courses Data Validation Courses Data Pipelines Courses Delta Lake Courses

Course Description

Overview

Explore a fast-paced 27-minute video presentation by Databricks Technical Leads and Champions Darren Fuller and Sandy May on productionizing Data Quality Pipelines for enterprise customers. Learn about their vision to empower business decisions on data remediation actions and self-healing of Data Pipelines through a library of Data Quality rule templates, reporting Data Model, and PowerBI reports. Discover how the Lakehouse pattern emphasizes Data Quality at the Lake layer, utilizing tools like Delta Lake for schema protection and column checking. Watch quick-fire demos showcasing how Apache Spark can be leveraged for applying rules over data at Staging or Curation points. Gain insights into simple and complex rule applications, including net sales calculations, value validations, statistical distribution validations, and complex pattern matching. Get a glimpse of future work in Data Compliance for PII data, involving rule generation using regex patterns and Machine Learning-based transfer learning.

Syllabus

Intro
Problem Statement
Dirty Data
Build or Buy
Design Decisions
Microsoft Enterprise Data Warehouse
Demo
Summary


Taught by

Databricks

Related Courses

Distributed Computing with Spark SQL
University of California, Davis via Coursera
Apache Spark (TM) SQL for Data Analysts
Databricks via Coursera
Building Your First ETL Pipeline Using Azure Databricks
Pluralsight
Implement a data lakehouse analytics solution with Azure Databricks
Microsoft via Microsoft Learn
Perform data science with Azure Databricks
Microsoft via Microsoft Learn