Big Data Analytics with Hadoop and Apache Spark
Offered By: LinkedIn Learning
Course Description
Overview
Discover how to build scalable and optimized data analytics pipelines by combining the powers of Apache Hadoop and Spark.
Syllabus
Introduction
- The combined power of Spark and Hadoop Distributed File System (HDFS)
- Apache Hadoop overview
- Apache Spark overview
- Integrating Hadoop and Spark
- Setting up the environment
- Using exercise files
- Storage formats
- Compression
- Partitioning
- Bucketing
- Best practices for data storage
- Reading external files into Spark
- Writing to HDFS
- Parallel writes with partitioning
- Parallel writes with bucketing
- Best practices for ingestion
- How Spark works
- Reading HDFS files with schema
- Reading partitioned data
- Reading bucketed data
- Best practices for data extraction
- Pushing down projections
- Pushing down filters
- Managing partitions
- Managing shuffling
- Improving joins
- Storing intermediate results
- Best practices for data processing
- Problem definition
- Data loading
- Total score analytics
- Average score analytics
- Top student analytics
- Next steps
Taught by
Kumaran Ponnambalam
Related Courses
Big Data Analytics in HealthcareGeorgia Institute of Technology via Udacity Mining Massive Datasets
Stanford University via edX The Caltech-JPL Summer School on Big Data Analytics
California Institute of Technology via Coursera Big Data Analytics for Healthcare
Georgia Institute of Technology via Coursera Data Lakes for Big Data
EdCast