Building Robust Streaming Data Pipelines with Apache Spark
Offered By: Linux Foundation via YouTube
Course Description
Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the challenges and solutions for building robust streaming data pipelines with Apache Spark in this 42-minute conference talk by Zak Hassan from Red Hat. Learn how to integrate Apache Kafka, Apache Spark, and Apache Camel to create a continuous data pipeline for Spark applications, addressing issues like dirty data in ETL processes. Discover techniques for extracting, transforming, and loading data from various systems into Apache Kafka, and leverage Spark's built-in Kafka connector. Gain insights into running these technologies inside Docker and benefit from lessons learned in real-world implementations. The talk covers data preparation, various data types and formats, and includes demonstrations comparing Hive and Spark, as well as practical examples using HDFS and Python code.
Syllabus
Introduction
Data Preparation
Data Types
Camel
Data formats
Demo
Hive vs Spark
Demo Time
Demo Starts
Logs
HDFS
Python
Code
Recap
Office Hours
Taught by
Linux Foundation
Tags
Related Courses
A Beginner’s Guide to DockerPackt via FutureLearn A Beginner's Guide to Kubernetes for Container Orchestration
Packt via FutureLearn Beginner’s Guide to Containers and Orchestration
A Cloud Guru Designing High Availability, Fault Tolerance, and DR with AWS Services
A Cloud Guru Docker Certified Associate (DCA)
A Cloud Guru