Building Robust Streaming Data Pipelines with Apache Spark

Offered By: Linux Foundation via YouTube

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Explore the challenges and solutions for building robust streaming data pipelines with Apache Spark in this 42-minute conference talk by Zak Hassan from Red Hat. Learn how to integrate Apache Kafka, Apache Spark, and Apache Camel to create a continuous data pipeline for Spark applications, addressing issues like dirty data in ETL processes. Discover techniques for extracting, transforming, and loading data from various systems into Apache Kafka, and leverage Spark's built-in Kafka connector. Gain insights into running these technologies inside Docker and benefit from lessons learned in real-world implementations. The talk covers data preparation, various data types and formats, and includes demonstrations comparing Hive and Spark, as well as practical examples using HDFS and Python code.

Syllabus

Introduction
Data Preparation
Data Types
Camel
Data formats
Demo
Hive vs Spark
Demo Time
Demo Starts
Logs
HDFS
Python
Code
Recap
Office Hours

Taught by

Linux Foundation

Building Robust Streaming Data Pipelines with Apache Spark

Tags

Course Description

Overview

Syllabus

Taught by

Tags

Related Courses

Building Robust Streaming Data Pipelines with Apache Spark

Tags

Course Description

Overview

Syllabus

Taught by

Tags

Related Courses

Login to Continue