Unsupervised Machine Learning for Scaling Data Quality Monitoring in Databricks
Offered By: Databricks via YouTube
Course Description
Overview
Explore how unsupervised machine learning can revolutionize data quality monitoring in Databricks in this 37-minute conference talk. Delve into the limitations of traditional rules and metrics approaches, and discover a set of fully unsupervised machine learning algorithms designed to monitor data quality at scale. Learn about the algorithms' functionality, strengths, and weaknesses, as well as their testing and calibration processes. Gain insights into unsupervised data quality monitoring techniques, their advantages and challenges, and practical steps to implement them in Databricks. Examine real-world examples using ticket sales data, and understand how to set up monitoring in Anomalo. Investigate various visualizations, including severity, explanation, distribution, and root cause analysis. Explore the process of encoding features automatically, building supervised models, and generating visualizations using SHAP values. Address challenges in implementation and testing, and learn how to get started with these techniques in Databricks.
Syllabus
Intro
Data Quality in the Modern Data Stack
Three Approaches to Data Quality Monitoring
Ticket Sales Data
Setup Monitoring in Anomalo
Anomalo Monitoring
Chaos Library
Check Log
Visualizations: Severity & Explanation
Visualizations Distribution
Visualizations: Root Cause Analysis
Encode Features Automatically
Build a Supervised Model
Generate Visualizations Using SHAP Values
Challenges
Testing
Get Started in Databricks
DATA+AI SUMMIT 2022
Taught by
Databricks
Related Courses
Intro to StatisticsStanford University via Udacity Introduction to Data Science
University of Washington via Coursera Passion Driven Statistics
Wesleyan University via Coursera Information Visualization
Indiana University via Independent DCO042 - Python For Informatics
University of Michigan via Independent