How Adobe Processes 2 Million Records Per Second Using Apache Spark
Offered By: Databricks via YouTube
Course Description
Overview
Explore how Adobe processes 2 million records per second using Apache Spark in this 41-minute Databricks conference talk. Dive into the challenges and solutions of Adobe's Unified Profile System, which ingests terabytes of data daily. Learn about optimizing repeated queries, understanding join operations, monitoring structured streaming lag, handling data skew, effective sampling techniques, and leveraging Redis for enhanced performance. Gain valuable insights from Adobe's experiences in scaling their Apache Spark deployment, including practical tips on caching physical plans, managing shuffles, dealing with backpressure, and making code resilient to skewed datasets. Benefit from real-world war stories and lessons that can be applied to large-scale data processing challenges in your own projects.
Syllabus
Intro
What do you mean by Processing? Agenda!
Unified Profile Data Ingestion
Generic Flow
Flow with MinPartitions partitions on Kafka
MicroBatch Hard! Logic Best Practices
An Example
For Repeated Queries Over Same DF
Join Optimization For Interactive Queries (Opinionated)
How to get the magic targetPartitionCount?
Digging into Redis Pipelining + Spark
Taught by
Databricks
Related Courses
CS115x: Advanced Apache Spark for Data Science and Data EngineeringUniversity of California, Berkeley via edX Big Data Analytics
University of Adelaide via edX Big Data Essentials: HDFS, MapReduce and Spark RDD
Yandex via Coursera Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames
Yandex via Coursera Introduction to Apache Spark and AWS
University of London International Programmes via Coursera