YoVDO

Building Robust Streaming Data Pipelines with Apache Spark

Offered By: Linux Foundation via YouTube

Tags

Apache Spark Courses Python Courses Docker Courses Apache Kafka Courses Apache Camel Courses HDFS Courses Data Streaming Courses Data Pipelines Courses ETL Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore the challenges and solutions for building robust streaming data pipelines with Apache Spark in this 42-minute conference talk by Zak Hassan from Red Hat. Learn how to integrate Apache Kafka, Apache Spark, and Apache Camel to create a continuous data pipeline for Spark applications, addressing issues like dirty data in ETL processes. Discover techniques for extracting, transforming, and loading data from various systems into Apache Kafka, and leverage Spark's built-in Kafka connector. Gain insights into running these technologies inside Docker and benefit from lessons learned in real-world implementations. The talk covers data preparation, various data types and formats, and includes demonstrations comparing Hive and Spark, as well as practical examples using HDFS and Python code.

Syllabus

Introduction
Data Preparation
Data Types
Camel
Data formats
Demo
Hive vs Spark
Demo Time
Demo Starts
Logs
HDFS
Python
Code
Recap
Office Hours


Taught by

Linux Foundation

Tags

Related Courses

A Beginner’s Guide to Docker
Packt via FutureLearn
A Beginner's Guide to Kubernetes for Container Orchestration
Packt via FutureLearn
Beginner’s Guide to Containers and Orchestration
A Cloud Guru
Designing High Availability, Fault Tolerance, and DR with AWS Services
A Cloud Guru
Docker Certified Associate (DCA)
A Cloud Guru