Spark Fundamentals II
Offered By: IBM via Cognitive Class
Course Description
Overview
Expand your knowledge of the concepts discussed in Spark Fundamentals I with a focus on RDDs (Resilient Distributed Datasets). RDDs are the main abstraction Spark provides to enable parallel processing across the nodes of a Spark cluster.
- Get in-deptth knowledge on Spark’s architecture and how data is distributed and tasks are parallelized.
- Learn how to optimize your data for joins using Spark’s memory caching.
- Learn how to use the more advanced operations available in the API.
- The lab exercises for this course are performed exclusively on the Cloud and using a Notebook interface.
Syllabus
- Module 1 - Introduction to Notebooks
- Understand how to use Zeppelin in your Spark projects
- Identify the various notebooks you can use with Spark
- Module 2 - Spark RDD Architecture
- Understand how Spark generates RDDs
- Manage partitions to improve RDD performance
- Module 3 - Optimizing Transformations and Actions
- Use advanced Spark RDD operations
- Identify what operations cause shuffling
- Module 4 - Caching and Serialization
- Understand how and when to cache RDDs
- Understand storage levels and their uses
- Module 5 - Develop and Testing
- Understand how to use sbt to build Spark projects
- Understand how to use Eclipse and IntelliJ for Spark development
Tags
Related Courses
Coding the Matrix: Linear Algebra through Computer Science ApplicationsBrown University via Coursera كيف تفكر الآلات - مقدمة في تقنيات الحوسبة
King Fahd University of Petroleum and Minerals via Rwaq (رواق) Datascience et Analyse situationnelle : dans les coulisses du Big Data
IONIS via IONIS Data Lakes for Big Data
EdCast 統計学Ⅰ:データ分析の基礎 (ga014)
University of Tokyo via gacco