Windowing and Join Operations on Streaming Data with Apache Spark on Databricks
Offered By: Pluralsight
Course Description
Overview
This course will teach you how to leverage windowing, watermarking, and join operations on streaming data in Spark for your specific use cases.
Structured Streaming in Apache Spark treats real-time data as a table that is being constantly appended. In such a stream processing model the burden of stream processing shifts from the user to the system, making it very easy and intuitive to process streaming data with Spark. Apache Spark supports a range of windowing and join operations on streaming data using processing time and event time. In this course, Windowing and Join Operations on Streaming Data with Apache Spark on Databricks, you will learn the difference between stateless operations that operate on a single streaming entity and stateful operations that operate on multiple entities accumulated in a stream. Then, you will explore the different kinds of windows supported by Apache Spark which includes tumbling windows, sliding windows, and global windows. Next, you will understand the differences between event time, ingestion time, and processing time and see how you can perform windowing operations using both processing time as well as event time. Along the way, you will connect to an HDInsight Kafka cluster to read records for your input stream. You will then use watermarking to deal with late-arriving data and see how you can use watermarks to limit the state that Apache Spark stores. Finally, you will perform join operations using streams and explore the types of joins that Spark supports for static-stream joins and stream-stream joins. You will also see how you can connect to Azure Event Hubs to read records. When you are finished with this course, you will have the skills and knowledge of windowing and join operations needed to identify when these powerful transformations should be performed and how they are performed.
Structured Streaming in Apache Spark treats real-time data as a table that is being constantly appended. In such a stream processing model the burden of stream processing shifts from the user to the system, making it very easy and intuitive to process streaming data with Spark. Apache Spark supports a range of windowing and join operations on streaming data using processing time and event time. In this course, Windowing and Join Operations on Streaming Data with Apache Spark on Databricks, you will learn the difference between stateless operations that operate on a single streaming entity and stateful operations that operate on multiple entities accumulated in a stream. Then, you will explore the different kinds of windows supported by Apache Spark which includes tumbling windows, sliding windows, and global windows. Next, you will understand the differences between event time, ingestion time, and processing time and see how you can perform windowing operations using both processing time as well as event time. Along the way, you will connect to an HDInsight Kafka cluster to read records for your input stream. You will then use watermarking to deal with late-arriving data and see how you can use watermarks to limit the state that Apache Spark stores. Finally, you will perform join operations using streams and explore the types of joins that Spark supports for static-stream joins and stream-stream joins. You will also see how you can connect to Azure Event Hubs to read records. When you are finished with this course, you will have the skills and knowledge of windowing and join operations needed to identify when these powerful transformations should be performed and how they are performed.
Syllabus
- Course Overview 2mins
- Performing Windowing Operations on Data 39mins
- Exploring Aggregations Using Watermarks 52mins
- Performing Join Operations on Data 29mins
Taught by
Janani Ravi
Related Courses
CS115x: Advanced Apache Spark for Data Science and Data EngineeringUniversity of California, Berkeley via edX Big Data Analytics
University of Adelaide via edX Big Data Essentials: HDFS, MapReduce and Spark RDD
Yandex via Coursera Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames
Yandex via Coursera Introduction to Apache Spark and AWS
University of London International Programmes via Coursera