YoVDO

Accelerating Data Processing in Spark SQL with Pandas UDFs - Optimization Techniques

Offered By: Databricks via YouTube

Tags

Spark SQL Courses Big Data Courses Python Courses Apache Spark Courses Performance Tuning Courses Data Analytics Courses Batch Processing Courses

Course Description

Overview

Discover optimization techniques for Spark SQL data processing using Pandas UDFs in this 27-minute video from Databricks. Learn how to accelerate query performance by over an order of magnitude through specialized batch processing jobs. Explore what Spark SQL excels at and where it falls short, and gain insights into implementing custom UDFs for significant performance gains. Understand how to profile Spark SQL jobs efficiently to validate optimization strategies. Follow along as the speaker shares experiences from developing a model training pipeline at Quantcast, processing petabytes of data for thousands of models. Dive into practical examples, including naive approaches and various optimization techniques such as looping with Pandas UDFs, aggregating keys in batches, using inverted indexes, and leveraging Python libraries. Equip yourself with valuable knowledge to enhance your data processing workflows in Spark SQL.

Syllabus

Intro
Optimization Tricks
What are Pandas UDFs?
Development tips and tricks
Modeling at Quantcast
Example Problem
Naive approach: Use Spark SOL
Optimization: Use Pandas UDFs for Looping
Optimization: Aggregate Keys in Batches
Optimization: Inverted Indexes
Optimization: Use python libraries
Optimization: Summary


Taught by

Databricks

Related Courses

CS115x: Advanced Apache Spark for Data Science and Data Engineering
University of California, Berkeley via edX
Big Data Analytics
University of Adelaide via edX
Big Data Essentials: HDFS, MapReduce and Spark RDD
Yandex via Coursera
Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames
Yandex via Coursera
Introduction to Apache Spark and AWS
University of London International Programmes via Coursera