Data Engineering with Databricks
Offered By: Pragmatic AI Labs via edX
Course Description
Overview
Master Data Engineering on Databricks Lakehouse Platform
- Learn Databricks architecture, cluster management & notebook analysis
- Build reliable ETL pipelines with Delta Lake for data transformation
- Implement advanced data processing techniques with Apache Spark
Course Highlights:
- Create & scale Databricks clusters for workloads
- Load data from diverse sources into notebooks
- Explore, visualize & profile datasets with notebooks
- Version control & share notebooks via Git integration
- Read & ingest data in various file formats
- Transform data with SQL & DataFrame operations
- Handle complex data types like arrays, structs, timestamps
- Deduplicate, join & flatten nested data structures
- Identify & fix data quality issues with UDFs
- Load cleansed data into Delta Lake for reliability
- Build production-ready pipelines with Delta Live Tables
- Schedule & monitor workloads using Databricks Jobs
- Secure data access with Unity Catalog
Gain comprehensive skills in data engineering on Databricks through hands-on labs, real-world projects and best practices for the modern data lakehouse.
Syllabus
Module 1: Databricks Lakehouse Platform Fundamentals
Introduction to the Databricks Lakehouse Platform and its architecture
Creating, managing, and configuring clusters
Setting up and using Databricks with IntelliJ, RStudio, and the Databricks CLI
Introduction to notebooks, including execution, sharing, and multi-language support
Efficient data transformation with Spark SQL and the Catalog Explorer
Creating tables from files and querying external data sources
Reliable data pipelines with Delta Lake, ACID transactions, and Z-Ordering optimization
Module 2: Data Transformation and Pipelines
Automated pipelines with Delta Live Tables
Delta Live Tables components
Continuous vs triggered pipelines
Configuring Auto Loader
Querying pipeline events
End-to-end example of Delta Live
Vacuum and garbage collection
Orchestrating workloads with Databricks Jobs
Multi-task workflows and task dependencies
Viewing job history
Using dashboards
Handling failures and configuring retries
Unified data access with Unity Catalog
Catalogs vs metastores
Unity Catalog quickstart in Python
Applying object security
Best practices for catalogs, connections, and business units
Taught by
Noah Gift and Alfredo Deza
Related Courses
Big Data EssentialsA Cloud Guru Big Data
University of Adelaide via edX Advanced Data Science with IBM
IBM via Coursera Amazon EMR Getting Started (Indonesian)
Amazon Web Services via AWS Skill Builder Analisar e preparar dados com o Amazon SageMaker Data Wrangler e o Amazon EMR (Português (Brasil)) | Lab - Analyze and Prepare Data with Amazon SageMaker Data Wrangler and Amazon EMR (Portuguese (Brazil))
Amazon Web Services via AWS Skill Builder