YoVDO

Beyond Shuffling - Scaling Apache Spark

Offered By: Scala Days Conferences via YouTube

Tags

Scala Days Courses Apache Spark Courses Cluster Management Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore advanced techniques for scaling Apache Spark in this 43-minute conference talk from Scala Days Berlin 2016. Delve into best practices and code snippets for handling large datasets efficiently. Learn to leverage Spark counters for performance investigation, optimize key-value data operations, and replace groupByKey with memory-efficient alternatives. Discover effective caching and checkpointing strategies to reduce execution time. Gain insights on functional transformations using Spark Datasets, working in noisy cluster environments, and utilizing Spark SQL for improved performance. Master the art of validating Spark jobs with accumulators and explore additional testing resources to enhance your Spark development skills.

Syllabus

Intro
What is going to be covered
The different pieces of Spark
What is key skew and why do we care?
Well there is a bit of magic in the shume....
Iterator tortor transformations
Why is Spark SQL good for those things?
How much faster can it be?
How to avoid lineage explosions
Introducing Datasets
And functional style maps
Switching gears: Valdating Spark jobs
Using an accumulator for validation
Validating records read matches our expectations
Additional Spark Testing Resources
Additional Spark Resources
Spark Videos


Taught by

Scala Days Conferences

Related Courses

CS115x: Advanced Apache Spark for Data Science and Data Engineering
University of California, Berkeley via edX
Big Data Analytics
University of Adelaide via edX
Big Data Essentials: HDFS, MapReduce and Spark RDD
Yandex via Coursera
Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames
Yandex via Coursera
Introduction to Apache Spark and AWS
University of London International Programmes via Coursera