YoVDO

Complete PySpark Developer Course (Spark with Python)

Offered By: Udemy

Tags

PySpark Courses Python Courses Apache Spark Courses

Course Description

Overview

Learn PySpark in depth with hundreds of Practical examples. Be a complete PySpark Developer. Set up a Hadoop Cluster.

What you'll learn:
  • Complete Curriculum for a successful PySpark Developer
  • Hadoop Single Node Cluster Set up and Integrate with Spark 2.x and Spark 3.x
  • Complete Flow of Installation of PySpark (Windows and Unix)
  • Detailed HDFS Course
  • Python Crash Course
  • Introduction to Spark
  • Understand SparkSession
  • Spark RDD Fundamentals, Operations, Persistence. Practical Examples to solve problems.
  • Spark Cluster Architecture - Execution, YARN, JVM Processes, DAG Scheduler, Task Scheduler
  • Spark Shared Variables
  • Spark SQL Architecture, Catalyst Optimizer, Volcano Iterator Model, Tungsten Execution Engine
  • DataFrame Fundamentals
  • DataFrame Rows, Columns and DataTypes. Practical examples.
  • ETL Using DataFrame (Extraction APIs, Transformation APIs, and Loading APIs). Practical Examples.
  • Optimization and Management - Join Strategies, Driver Conf, Executor Conf etc

This is a complete PySpark Developer course for Data Engineers and Data Scientists and others who wants to process Big Data in an effective manner. We will cover below topics and more:

  • Complete Curriculum for a successful PySpark Developer

  • Set up Hadoop Single Node Cluster and Integrate it with Spark 2.x and Spark 3.x

  • Complete Flow of Installation of Standalone PySpark (Unix and Windows Operating System)

  • Detailed HDFS Commands and Architecture.

  • Python Crash Course

  • Introduction to Spark (Why Spark was Developed, Spark Features, Spark Components)

  • Understand SparkSession

  • Spark RDD Fundamentals

  • How to Create RDDs

  • RDD Operations (Transformations & Actions)

  • Spark Cluster Architecture - Execution, YARN, JVM Processes, DAG Scheduler, Task Scheduler

  • RDD Persistence

  • Spark Shared Variables - Broadcast

  • Spark Shared Variables - Accumulators)

  • Spark SQL Architecture, Catalyst Optimizer, Volcano Iterator Model, Tungsten Execution Engine, Different Benchmarks

  • Difference between Catalyst Optimizer and Volcano Iterator Model

  • Spark Commonly Used Functions - Version, range, createDataFrame, sql, table, SparkContext, conf, read, udf, newSession, stop, catalog etc

  • DataFrame Built-in functions - new column functions, encryption functions, string functions, regexp functions, date functions, null functions, collection functions, na functions, math and statistics functions, explode functions, flatten functions, formatting and json functions

  • What is Partition,

  • What is Repartition

  • What is Coalesce

  • Repartition Vs Coalesce

  • Extraction - csv file, text file, Parquet File, orc file, json file, avro file, hive, jdbc

  • DataFrame Fundamentals

  • What is a DataFrame

  • DataFrame Sources

  • DataFrame Features

  • DataFrame Organization

  • DataFrame Rows,

  • DataFrame Columns

  • DataTypes. Practical examples.

  • Perform ETL Using DataFrame

    -- Extraction APIs

    -- Transformation APIs

    -- Loading APIs

    -- Practical Examples.

  • Optimization and Management - Join Strategies, Driver Conf, Parallelism Configurations, Executor Conf etc



Taught by

Learn-Spark.info (Spark University)

Related Courses

CS115x: Advanced Apache Spark for Data Science and Data Engineering
University of California, Berkeley via edX
Big Data Analytics
University of Adelaide via edX
Big Data Essentials: HDFS, MapReduce and Spark RDD
Yandex via Coursera
Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames
Yandex via Coursera
Introduction to Apache Spark and AWS
University of London International Programmes via Coursera