YoVDO

Apache Spark Core - Practical Optimization Techniques - Partition Shaping and Job Optimization

Offered By: Databricks via YouTube

Tags

Apache Spark Courses Cluster Computing Courses

Course Description

Overview

Dive into a comprehensive conference talk on Apache Spark Core optimization techniques. Learn how to properly shape partitions and jobs to enable powerful optimizations, eliminate skew, and maximize cluster utilization. Explore various Spark Partition shaping methods along with several optimization strategies, including join optimizations, aggregate optimizations, salting, and multi-dimensional parallelism. Gain insights into software hierarchy, hardware considerations, and practical demonstrations. Discover techniques such as lazy loading, data skipping, and shuffle partition management. Understand the importance of input and output partitions, workload balancing, and persistence strategies. Delve into advanced topics like DBIO Cache, Joint Optimization, Broadcast Join, and Skew Joins. By the end of this 1 hour and 32 minutes talk, master the skills needed to optimize Apache Spark Core for improved performance and efficiency in data analytics tasks.

Syllabus

Introduction
About Daniel
Agenda
Software Hierarchy
Demo
Hardware
Baseline
CP Utilization
ganglia reports
lazy loading
code
data skipping
optimizations
output
shuffle partitions
workload
shuffle partition example
shuffle partition summary
input partition summary
what does this do
output partitions
workload example
Partitions
Balance
Persistence
DBIO Cache
Joint Optimization
Broadcast Join
Skew Joins
Group Buys
The Beast


Taught by

Databricks

Related Courses

Managing Big Data in Clusters and Cloud Storage
Cloudera via Coursera
The Complete Apache Kafka Practical Guide
Udemy
Dynamical Systems in Neuroscience
MITCBMM via YouTube
Dimensionality Reduction II
MITCBMM via YouTube
Optimizing Spark SQL Jobs with Parallel and Asynchronous IO
Databricks via YouTube