YoVDO

Best Practices for Building Robust Data Platforms with Apache Spark and Delta

Offered By: Databricks via YouTube

Tags

Apache Spark Courses Big Data Courses Data Pipelines Courses Delta Lake Courses

Course Description

Overview

Discover best practices for building robust data platforms using Apache Spark and Delta in this 27-minute talk from Databricks. Learn from real-world experiences to overcome technical challenges and create performant, scalable pipelines. Gain insights into operational tips for Apache Spark in production, optimal data pipeline design, and common misconfigurations to avoid. Explore strategies for optimizing costs, achieving performance at scale, and ensuring security compliance with GDPR and CCPA. Acquire valuable knowledge on cluster sizing, instance type selection, and workload optimization using Spark UI and Ganglia Metrics. Understand the benefits of Adaptive Query Execution and data governance with Delta Lake. Suitable for attendees with some experience in setting up Big Data pipelines and Apache Spark.

Syllabus

Intro
Data Challenges
Usual Data Lake
Getting the Data Right
Best Practices for Cluster Sizing & Selection
Selection of Instance Types
Selection of node size Rule of thumb
Observe Spark UI & tweak the workloads
Observe Ganglia Metrics & tweak the workloads
Performance Symptoms
Adaptive Ouery Execution
Data Governance with Delta Lake
Audit & Monitoring


Taught by

Databricks

Related Courses

CS115x: Advanced Apache Spark for Data Science and Data Engineering
University of California, Berkeley via edX
Big Data Analytics
University of Adelaide via edX
Big Data Essentials: HDFS, MapReduce and Spark RDD
Yandex via Coursera
Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames
Yandex via Coursera
Introduction to Apache Spark and AWS
University of London International Programmes via Coursera