YoVDO

Optimizing Catalyst Optimizer for Complex Spark Plans

Offered By: Databricks via YouTube

Tags

Apache Spark Courses Performance Tuning Courses Big Data Analytics Courses Data Validation Courses

Course Description

Overview

Explore optimization techniques for complex Apache Spark plans in this 27-minute conference talk from DAIS NA 2021. Dive into Workday's experience building analytics products with Spark, addressing challenges like compiling large-scale DataFrames and handling extensive case statements. Learn about memory-efficient plan logging, common subexpression elimination for redundant subplan removal, and rewriting Spark's constraint propagation mechanism. Discover how these enhancements improve Catalyst performance on production pipelines, and gain valuable tips for managing complex Spark plans in your own projects. The talk covers topics such as data validation, handling large case expressions, optimized constraint propagation, and future improvements in Spark optimization.

Syllabus

Intro
Spark in Workday Prism Analytics
Example: Data Validation
About Complex Plans
Common Subexpression Elimination (CSE)
CSE Benchmark
Logging Complex Plans (10s of MBs in Size)
Problems with Large Case Expressions
Handling Large Case Expressions in Catalyst
Large Case Expression Benchmark
Example: Generate New Filter
Example: Prune Redundant Filter
Example: New Filter on Other Side of Join
Current Constraint Propagation Algorithm
Current Algorithm Takes High Memory
Recall: Fix for Large Case Expressions
Optimized Constraint Propagation (SPARK-33152)
Constraint Propagation Algorithms Comparison
Constraint Propagation Benchmark
Effect on Customer Pipeline
Tuning Tips
Future Work


Taught by

Databricks

Related Courses

CS115x: Advanced Apache Spark for Data Science and Data Engineering
University of California, Berkeley via edX
Big Data Analytics
University of Adelaide via edX
Big Data Essentials: HDFS, MapReduce and Spark RDD
Yandex via Coursera
Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames
Yandex via Coursera
Introduction to Apache Spark and AWS
University of London International Programmes via Coursera