YoVDO

Migrating Airflow-Based Apache Spark Jobs to Kubernetes - The Native Way

Offered By: Databricks via YouTube

Tags

Apache Spark Courses Kubernetes Courses Data Engineering Courses Cloud Migration Courses Containerization Courses

Course Description

Overview

Learn how to migrate Apache Spark workloads from AWS EMR to Kubernetes in this 21-minute conference talk by Databricks. Explore the challenges of existing Spark infrastructure and the motivation behind migrating to Kubernetes. Discover aspects of running Spark natively on Kubernetes, including monitoring and logging. Gain insights into best practices for using Airflow as an orchestrator. Follow the journey of Nielsen Identity as they process massive amounts of data using Apache Spark, and understand how they combined the GCP Spark-on-K8s operator with a native Airflow integration to achieve their goals. Dive into topics such as Kubernetes auto-scaling, Spark-On-Kubernetes overview, and making the migration production-ready. This talk provides valuable information for data engineers and architects looking to optimize their Spark workloads and reduce operational costs.

Syllabus

Introduction
What will you learn?
Nielsen Identity in numbers
Common data pipeline pattern - Airflow DAG
Spark clusters
What is EMR?
EMR pricing - example
Running Airflow-based Spark jobs on EMR
Basic Kubernetes terminology
Kubernetes auto-scale
Spark-On-Kubernetes overview
Spark-submit example - SparkPi
Spark-On-Kubernetes operator example - SparkPi
Airflow Spark Kubernetes integration
Common data pipeline pattern - revised
Connecting the dots... making it production-ready
Visibility
Robustness
Airflow integration current status


Taught by

Databricks

Related Courses

CS115x: Advanced Apache Spark for Data Science and Data Engineering
University of California, Berkeley via edX
Big Data Analytics
University of Adelaide via edX
Big Data Essentials: HDFS, MapReduce and Spark RDD
Yandex via Coursera
Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames
Yandex via Coursera
Introduction to Apache Spark and AWS
University of London International Programmes via Coursera