YoVDO

Architecting for Data Quality in the Lakehouse with Delta Lake and PySpark

Offered By: Databricks via YouTube

Tags

PySpark Courses Apache Spark Courses Delta Lake Courses Storage Optimization Courses

Course Description

Overview

Explore architecting for data quality in the lakehouse with Delta Lake and PySpark in this comprehensive tech talk. Learn how to combat data downtime by adopting DevOps and software engineering best practices. Discover techniques for identifying, resolving, and preventing data issues across the data lakehouse. Gain insights into optimizing data reliability across metadata, storage, and query engine tiers. Build your own data observability monitors using PySpark and understand the role of tools like Delta Lake in scaling this design. Dive into topics such as the Data Quality Cone of Anxiety, data observability principles, and the Data Reliability Lifecycle. Examine the differences between data lakes and warehouses, and explore practical examples of measuring update times, loading data, and feature engineering. Access accompanying exercises and Jupyter notebooks to apply your newfound knowledge in real-world scenarios.

Syllabus

Intro
Welcome
Introductions
Agenda
Data Quality Cone of Anxiety
How do we address bad data
What is data observability
Freshness
Distribution
Volume
Schema
Data Lineage
Data Reliability Lifecycle
Lake vs Warehouse
Metadata
Storage
Query logs
Query engine
Questions
Describe Detail
Architecture for observability
Measuring update times
Loading data in CSV or JSON
Update cadence
Feature engineering
Lambda function
Delay between updates
Model Parameters
Training Labels
Questions and Answers
Summary
Upcoming events
Data Quality Fundamentals
Monte Carlo


Taught by

Databricks

Related Courses

Big Data Essentials
A Cloud Guru
Big Data
University of Adelaide via edX
Advanced Data Science with IBM
IBM via Coursera
Amazon EMR Getting Started (Indonesian)
Amazon Web Services via AWS Skill Builder
Analisar e preparar dados com o Amazon SageMaker Data Wrangler e o Amazon EMR (Português (Brasil)) | Lab - Analyze and Prepare Data with Amazon SageMaker Data Wrangler and Amazon EMR (Portuguese (Brazil))
Amazon Web Services via AWS Skill Builder