The Ultimate Hands-On Hadoop: Tame your Big Data!
Offered By: Skillshare
Course Description
Overview
Learn and master the most popular big data technologies in this comprehensive course, taught by a former engineer and senior manager from Amazon and IMDb. We'll go way beyond Hadoop itself, and dive into all sorts of distributed systems you may need to integrate with.
- Install and work with a real Hadoop installation right on your desktop with Hortonworks and the Ambari UI
- Manage big data on a cluster with HDFS and MapReduce
- Write programs to analyze data on Hadoop with Pig and Spark
- Store and query your data with Sqoop, Hive, MySQL, HBase, Cassandra, MongoDB, Drill, Phoenix, and Presto
- Design real-world systems using the Hadoop ecosystem
- Learn how your cluster is managed with YARN, Mesos, Zookeeper, Oozie, Zeppelin, and Hue
- Handle streaming data in real time with Kafka, Flume, Spark Streaming, Flink, and Storm
Syllabus
- Introduction
- Install Hadoop on your Desktop
- Hadoop Overview and History
- Overview of the Hadoop Ecosystem
- HDFS: What it is, and how it works
- [Activity] Install the MovieLens dataset into HDFS using the Ambari UI
- [Activity] Install the MovieLens dataset into HDFS using the command line
- MapReduce: What it is, and how it works
- How MapReduce distributes processing
- MapReduce example: Break down movie ratings by rating score
- [Activity] Installing Python, MRJob, and nano
- [Activity] Code up the ratings histogram MapReduce job and run it
- [Exercise] Rank movies by their popularity
- [Activity] Check your results against mine!
- Introducing Ambari
- Introducing Pig
- Example: Find the oldest movie with a 5-star rating using Pig
- [Activity] Find old 5-star movies with Pig
- More Pig Latin
- [Exercise] Find the most-rated one-star movie
- Pig Challenge: Compare Your Results to Mine!
- Why Spark?
- The Resilient Distributed Dataset (RDD)
- [Activity] Find the movie with the lowest average rating - with RDD's
- Datasets and Spark 2.0
- [Activity] Find the movie with the lowest average rating - with DataFrames
- [Activity] Movie recommendations with MLLib
- [Exercise] Filter the lowest-rated movies by number of ratings
- [Activity] Check your results against mine!
- What is Hive?
- [Activity] Use Hive to find the most popular movie[Activity] Use Hive to find the most popular movie
- How Hive works
- [Exercise] Use Hive to find the movie with the highest average rating
- Compare your solution to mine.
- Integrating MySQL with Hadoop
- [Activity] Install MySQL and import our movie data
- [Activity] Use Sqoop to import data from MySQL to HFDS/Hive
- [Activity] Use Sqoop to export data from Hadoop to MySQL
- Why NoSQL?
- What is HBase
- [Activity] Import movie ratings into HBase
- [Activity] Use HBase with Pig to import data at scale.
- Cassandra overview
- [Activity] Installing Cassandra
- [Activity] Write Spark output into Cassandra
- MongoDB Overview
- [Activity] Install MongoDB, and integrate Spark with MongoDB
- [Activity] Using the MongoDB shell
- Choosing a database technology
- [Exercise] Choose a database for a given problem
- Overview of Drill
- [Activity] Setting Up Drill
- [Activity] Querying across multiple databases with Drill
- Overview of Phoenix
- [Activity] Install Phoenix and query HBase with it
- [Activity] Integrate Phoenix with Pig
- Overview of Presto
- [Activity] Install Presto, and query Hive with it.
- [Activity] Query both Cassandra and Hive using Presto.
- YARN explained
- Tez explained
- [Activity] Use Hive on Tez and measure the performance benefit
- Mesos explained
- ZooKeeper explained
- [Activity] Simulating a failing master with ZooKeeper
- Oozie explained
- [Activity] Set up a simple Oozie workflow
- Zeppelin overview
- [Activity] Use Zeppelin to analyze movie ratings, part 1
- [Activity] Use Zeppelin to analyze movie ratings, part 2
- Hue overview
- Other technologies worth mentioning
- Kafka explained
- [Activity] Setting up Kafka, and publishing some data.
- [Activity] Publishing web logs with Kafka
- Flume explained
- [Activity] Set Up Flume and publish logs with Spark
- [Activity] Set up Flume to monitor a directory and store its data in HDFS
- Spark Streaming: Introduction
- [Activity] Analyze web logs published with Flume using Spark Streaming
- [Exercise] Monitor Flume-published logs for errors in real time
- Exercise solution: Aggregating HTTP access codes with Spark Streaming
- Apache Storm: Introduction
- [Activity] Count words with Storm
- Flink: An Overview
- [Activity] Counting words with Flink
- The Best of the Rest
- Review: How the pieces fit together
- Understanding your requirements
- Sample application: consume webserver logs and keep track of top-sellers
- Sample application: serving movie recommendations to a website
- [Exercise] Design a system to report web sessions per day
- Exercise solution: Design a system to count daily sessions
Taught by
Frank Kane
Related Courses
Big Data Computing with SparkThe Hong Kong University of Science and Technology via edX Advanced Big Data Systems | 高级大数据系统
Tsinghua University via edX Apache Spark Essential Training
LinkedIn Learning 数据科学 | Data Science
Tsinghua University via edX Data Streaming
Udacity