Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Offered By: Databricks via YouTube
Course Description
Overview
Explore a 21-minute conference talk on scaling machine learning feature engineering in Apache Spark at Facebook. Dive into the implementation of Feature Injection and Feature Reaping techniques, including Spark core/SQL enhancements, indexed/aligned tables, and the new ORC FlatMap encoding. Learn about catalyst optimizations, new ORC physical encodings for feature maps, and the process of writing/committing indexed feature tables. Gain insights into Facebook's approach to improving prediction model quality through efficient data management and processing techniques in Spark.
Syllabus
Intro
Machine Learning at Facebook
Data Layouts (Tables and Physical Encodings)
Background: Apache ORC
How is a Feature Map Stored in ORC?
Introducing: ORC Flattened Map
Feature Reaping
Introducing: Aligned Table
Query Plan for Aligned Table
Reading Aligned Tables
End to End Performance
Summary
Future Work
Taught by
Databricks
Related Courses
Data Science at Scale - Capstone ProjectUniversity of Washington via Coursera Feature Engineering for Improving Learning Environments
University of Texas Arlington via edX How to Win a Data Science Competition: Learn from Top Kagglers
Higher School of Economics via Coursera Advanced Machine Learning
The Open University via FutureLearn Feature Engineering
Google Cloud via Coursera