Building Robust Streaming Data Pipelines with Apache Spark
Offered By: Linux Foundation via YouTube
Course Description
Overview
Explore the challenges and solutions for building robust streaming data pipelines with Apache Spark in this 42-minute conference talk by Zak Hassan from Red Hat. Learn how to integrate Apache Kafka, Apache Spark, and Apache Camel to create a continuous data pipeline for Spark applications, addressing issues like dirty data in ETL processes. Discover techniques for extracting, transforming, and loading data from various systems into Apache Kafka, and leverage Spark's built-in Kafka connector. Gain insights into running these technologies inside Docker and benefit from lessons learned in real-world implementations. The talk covers data preparation, various data types and formats, and includes demonstrations comparing Hive and Spark, as well as practical examples using HDFS and Python code.
Syllabus
Introduction
Data Preparation
Data Types
Camel
Data formats
Demo
Hive vs Spark
Demo Time
Demo Starts
Logs
HDFS
Python
Code
Recap
Office Hours
Taught by
Linux Foundation
Tags
Related Courses
Google Cloud Big Data and Machine Learning Fundamentals en EspañolGoogle Cloud via Coursera Big Data Emerging Technologies
Yonsei University via Coursera Building Resilient Streaming Systems on GCP em Português Brasileiro
Google Cloud via Coursera Building Resilient Streaming Systems on Google Cloud Platform en Español
Google Cloud via Coursera AWS Certified Data Analytics Specialty 2024 - Hands On!
Udemy