Big Data Analytics with Hadoop and Apache Spark
Offered By: LinkedIn Learning
Course Description
Overview
Discover how to build scalable and optimized data analytics pipelines by combining the powers of Apache Hadoop and Spark.
Syllabus
Introduction
- The combined power of Spark and Hadoop Distributed File System (HDFS)
- Apache Hadoop overview
- Apache Spark overview
- Integrating Hadoop and Spark
- Setting up the environment
- Using exercise files
- Storage formats
- Compression
- Partitioning
- Bucketing
- Best practices for data storage
- Reading external files into Spark
- Writing to HDFS
- Parallel writes with partitioning
- Parallel writes with bucketing
- Best practices for ingestion
- How Spark works
- Reading HDFS files with schema
- Reading partitioned data
- Reading bucketed data
- Best practices for data extraction
- Pushing down projections
- Pushing down filters
- Managing partitions
- Managing shuffling
- Improving joins
- Storing intermediate results
- Best practices for data processing
- Problem definition
- Data loading
- Total score analytics
- Average score analytics
- Top student analytics
- Next steps
Taught by
Kumaran Ponnambalam
Related Courses
Cloud Computing Concepts, Part 1University of Illinois at Urbana-Champaign via Coursera Cloud Computing Concepts: Part 2
University of Illinois at Urbana-Champaign via Coursera Reliable Distributed Algorithms - Part 1
KTH Royal Institute of Technology via edX Introduction to Apache Spark and AWS
University of London International Programmes via Coursera Réalisez des calculs distribués sur des données massives
CentraleSupélec via OpenClassrooms