YoVDO

Faster Data Integration Pipeline Execution Using Spark-Jobserver

Offered By: Databricks via YouTube

Tags

Apache Spark Courses Big Data Courses Data Visualization Courses REST APIs Courses Data Integration Courses Kerberos Courses ETL Courses

Course Description

Overview

Explore a 32-minute conference talk from Databricks on leveraging Spark-Jobserver to enhance data integration pipeline execution. Learn how Informatica utilizes Spark-Jobserver's capabilities to solve data visualization challenges for hierarchical data in Big Data pipelines. Discover the benefits of Spark context reuse for faster task execution, integration techniques using REST APIs, and strategies for managing parallel job execution and monitoring. Gain insights into configuring Spark-Jobserver with YARN cluster mode, handling secure SSL-enabled clusters, and managing multiple Spark-Jobserver instances. Delve into topics such as concurrent job execution, dependency resolution, and the journey of adopting Spark-Jobserver in a data integration product.

Syllabus

Intro
Informatica ETL Pipeline
Dealing with buggy pipelines
Data Preview - Feature Requirements
What spark-submit based data preview achieved?
Execution Profiling Results - Spark-submit
Compare Spark-submit with Spark Job Server
Spark-submit based Architecture
SJS based Architecture
Execution Flow
Spark Job Server vs Spark-submit
Setup Details
Getting started
Environment Variables (local.sh. template)
Application Code Migration
WordCount Example
Running Jobs
Handling Job Dependencies
Multiple Spark Job Servers
Concurrency
Support for Kerberos
HTTPS/SSL Enabled Server
Logging
Key Takeaways
Timeouts (in local.conf. template)
Complex Data Representation in Informatica Developer Tool
Monitoring: Binaries
Monitoring: Spark Context
Monitoring: Jobs
Monitoring: Yarn Job


Taught by

Databricks

Related Courses

Web Intelligence and Big Data
Indian Institute of Technology Delhi via Coursera
Big Data for Better Performance
Open2Study
Big Data and Education
Columbia University via edX
Big Data Analytics in Healthcare
Georgia Institute of Technology via Udacity
Data Mining with Weka
University of Waikato via Independent