Spark Fundamentals II
Offered By: IBM via Cognitive Class
Course Description
Overview
Expand your knowledge of the concepts discussed in Spark Fundamentals I with a focus on RDDs (Resilient Distributed Datasets). RDDs are the main abstraction Spark provides to enable parallel processing across the nodes of a Spark cluster.
- Get in-deptth knowledge on Spark’s architecture and how data is distributed and tasks are parallelized.
- Learn how to optimize your data for joins using Spark’s memory caching.
- Learn how to use the more advanced operations available in the API.
- The lab exercises for this course are performed exclusively on the Cloud and using a Notebook interface.
Syllabus
- Module 1 - Introduction to Notebooks
- Understand how to use Zeppelin in your Spark projects
- Identify the various notebooks you can use with Spark
- Module 2 - Spark RDD Architecture
- Understand how Spark generates RDDs
- Manage partitions to improve RDD performance
- Module 3 - Optimizing Transformations and Actions
- Use advanced Spark RDD operations
- Identify what operations cause shuffling
- Module 4 - Caching and Serialization
- Understand how and when to cache RDDs
- Understand storage levels and their uses
- Module 5 - Develop and Testing
- Understand how to use sbt to build Spark projects
- Understand how to use Eclipse and IntelliJ for Spark development
Tags
Related Courses
Gérez des flux de données temps réelCentraleSupélec via OpenClassrooms 現役シリコンバレーエンジニアが教えるPython 3 入門 + 応用 +アメリカのシリコンバレー流コードスタイル
Udemy Selenium WebDriver 4, Cucumber BDD, Java & More! [NEW: 2023]
Udemy Advanced Data and Stream Processing with Microsoft TPL Dataflow
Pluralsight Amazon Simple Storage Service (Amazon S3) Performance Optimization (German)
Amazon Web Services via AWS Skill Builder