Spark Fundamentals II
Offered By: IBM via Cognitive Class
Course Description
Overview
Expand your knowledge of the concepts discussed in Spark Fundamentals I with a focus on RDDs (Resilient Distributed Datasets). RDDs are the main abstraction Spark provides to enable parallel processing across the nodes of a Spark cluster.
- Get in-deptth knowledge on Spark’s architecture and how data is distributed and tasks are parallelized.
- Learn how to optimize your data for joins using Spark’s memory caching.
- Learn how to use the more advanced operations available in the API.
- The lab exercises for this course are performed exclusively on the Cloud and using a Notebook interface.
Syllabus
- Module 1 - Introduction to Notebooks
- Understand how to use Zeppelin in your Spark projects
- Identify the various notebooks you can use with Spark
- Module 2 - Spark RDD Architecture
- Understand how Spark generates RDDs
- Manage partitions to improve RDD performance
- Module 3 - Optimizing Transformations and Actions
- Use advanced Spark RDD operations
- Identify what operations cause shuffling
- Module 4 - Caching and Serialization
- Understand how and when to cache RDDs
- Understand storage levels and their uses
- Module 5 - Develop and Testing
- Understand how to use sbt to build Spark projects
- Understand how to use Eclipse and IntelliJ for Spark development
Tags
Related Courses
CS115x: Advanced Apache Spark for Data Science and Data EngineeringUniversity of California, Berkeley via edX Big Data Analytics
University of Adelaide via edX Big Data Essentials: HDFS, MapReduce and Spark RDD
Yandex via Coursera Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames
Yandex via Coursera Introduction to Apache Spark and AWS
University of London International Programmes via Coursera