Migrating Pinterest Apache Spark Clusters from HDFS to S3
Offered By: Databricks via YouTube
Course Description
Overview
Explore the migration process of Pinterest's critical Apache Spark clusters from HDFS to S3 in this 30-minute presentation. Dive into the motivations behind the transition, including the shift from Mesos to YARN as the resource scheduler. Learn about the technical challenges faced, such as S3 performance, consistency, and access control, and how they were addressed to match HDFS capabilities. Discover the changes made to job submission processes to accommodate differences between Mesos and YARN. Gain insights into Spark performance optimization through profiling and EC2 instance type selection. Examine the performance results and smooth migration process achieved by Pinterest. Understand key takeaways, including read-after-write consistency solutions, performance comparisons between S3 and HDFS, strategies for dealing with metadata operations, and improvements to S3Committer. Explore the benefits of S3 over HDFS, cost savings, and the current state of Spark at Pinterest.
Syllabus
Intro
Agenda
Big Data Platform
Old vs New cluster
Old Cluster: Performance Bottleneck
A Simple Aggregation Query
9k Mappers * 9k Reducers
New Cluster: Choose the right EC2 instance
Key Takeaways
Read after write consistency
How often does this happen
Solution. Considerations
Our Approach
Performance Comparison: S3 vs HDFS
Dealing with Metadata Operation
Reduce Move Operations
Multipart Upload API
The Last Move Operation
Fix Bucket Rate Limit Issue (503)
Improving S3Committer
S3 Benefit Compare to HDFS
Things We Miss in Mesos
Cost Saving
Spark at Pinterest
Taught by
Databricks
Related Courses
iOS Persistence and Core DataUdacity Data Migration to SAP S/4HANA
SAP Learning Deep Dive into Amazon Glacier
Amazon via Independent Upgrade2Success – Making SAP ERP HCM Migration Easier
SAP Learning Migrating Your Business Data to SAP S/4HANA – New Implementation Scenario
SAP Learning