Building Data Quality Pipelines with Apache Spark and Delta Lake
Offered By: Databricks via YouTube
Course Description
Overview
Explore a fast-paced 27-minute video presentation by Databricks Technical Leads and Champions Darren Fuller and Sandy May on productionizing Data Quality Pipelines for enterprise customers. Learn about their vision to empower business decisions on data remediation actions and self-healing of Data Pipelines through a library of Data Quality rule templates, reporting Data Model, and PowerBI reports. Discover how the Lakehouse pattern emphasizes Data Quality at the Lake layer, utilizing tools like Delta Lake for schema protection and column checking. Watch quick-fire demos showcasing how Apache Spark can be leveraged for applying rules over data at Staging or Curation points. Gain insights into simple and complex rule applications, including net sales calculations, value validations, statistical distribution validations, and complex pattern matching. Get a glimpse of future work in Data Compliance for PII data, involving rule generation using regex patterns and Machine Learning-based transfer learning.
Syllabus
Intro
Problem Statement
Dirty Data
Build or Buy
Design Decisions
Microsoft Enterprise Data Warehouse
Demo
Summary
Taught by
Databricks
Related Courses
Distributed Computing with Spark SQLUniversity of California, Davis via Coursera Apache Spark (TM) SQL for Data Analysts
Databricks via Coursera Building Your First ETL Pipeline Using Azure Databricks
Pluralsight Implement a data lakehouse analytics solution with Azure Databricks
Microsoft via Microsoft Learn Perform data science with Azure Databricks
Microsoft via Microsoft Learn