YoVDO

Accelerating Apache Spark Shuffle for Data Analytics on Cloud with Remote Persistent Memory Pools

Offered By: Databricks via YouTube

Tags

Apache Spark Courses Big Data Courses Cloud Computing Courses Distributed Systems Courses Data Analytics Courses Scalability Courses Persistent Memory Courses

Course Description

Overview

Explore a 33-minute conference talk on accelerating Apache Spark shuffle operations for cloud-based data analytics using remote persistent memory pools. Dive into the challenges of serving growing data-driven AI and analytics workloads in disaggregated storage and compute environments. Learn about a proposed fully disaggregated shuffle solution leveraging persistent memory and RDMA technologies, including a new pluggable shuffle manager and distributed storage system. Discover how this innovative approach improves Spark's scalability, performance, and reliability, with experimental results showing up to 10x performance speedup over traditional shuffle solutions. Gain insights into the architecture, optimization features, and workflow of this cutting-edge solution presented by Databricks.

Syllabus

Introduction
Agenda
Motivation
Recap
Original Example
Results
New Challenges
Rtmp Architecture
Optimization Features
Workflow
Summary
Performance Evaluation
Examples
Call to Action
Optima Natives


Taught by

Databricks

Related Courses

CS115x: Advanced Apache Spark for Data Science and Data Engineering
University of California, Berkeley via edX
Big Data Analytics
University of Adelaide via edX
Big Data Essentials: HDFS, MapReduce and Spark RDD
Yandex via Coursera
Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames
Yandex via Coursera
Introduction to Apache Spark and AWS
University of London International Programmes via Coursera