Spark, Hadoop, and Snowflake for Data Engineering
Offered By: Pragmatic AI Labs via edX
Course Description
Overview
In this course, you will:
- Explore essential data engineering platforms (Hadoop, Spark, and Snowflake) and learn how to optimize and manage them
- Delve into Databricks, a powerful platform for executing data analytics and machine learning tasks
- Hone your Python data science skills with PySpark
- Discover the key concepts of MLflow, an open-source platform for managing the end-to-end machine learning lifecycle, and learn how to integrate it with Databricks
- Gain methodologies to help you improve your project management and workflow skills for data engineering, including applying Kaizen, DevOps, and Data Ops best practices
This course is designed for learners who want to pursue or advance their career in data science or data engineering, or for software developers or engineers who want to grow their data management skill set. With quizzes to test your knowledge throughout, this comprehensive course will help guide your learning journey to become a proficient data engineer, ready to tackle the challenges of today's data-driven world.
Syllabus
Module 1: Overview and Introduction to PySpark (7 hours)
- 10 videos (Total 25 minutes)
- Meet your Co-Instructor: Kennedy Behrman (0 minutes, Preview module)
- Meet your Co-Instructor: Noah Gift (1 minute)
- Overview of Big Data Platforms (1 minute)
- Getting Started with Hadoop (1 minute)
- Getting Started with Spark (1 minute)
- Introduction to Resilient Distributed Datasets (RDD) (2 minutes)
- Resilient Distributed Datasets (RDD) Demo (4 minutes)
- Introduction to Spark SQL (1 minute)
- PySpark Dataframe Demo: Part 1 (3 minutes)
- PySpark Dataframe Demo: Part 2 (7 minutes)
- 9 readings (Total 90 minutes)
- Welcome to Data Engineering Platforms with Python! (10 minutes)
- What is Apache Hadoop? (10 minutes)
- What is Apache Spark? (10 minutes)
- Use Apache Spark in Azure Databricks (optional) (10 minutes)
- Choosing between Hadoop and Spark (10 minutes)
- What are RDDs? (10 minutes)
- Getting Started: Creating RDD's with PySpark (10 minutes)
- Spark SQL, Dataframes and Datasets (10 minutes)
- PySpark and Spark SQL (10 minutes)
- 7 quizzes (Total 210 minutes)
- PySpark (30 minutes)
- Big Data Platforms (30 minutes)
- Apache Hadoop Concepts (30 minutes)
- Apache Spark Concepts (30 minutes)
- RDD Concepts (30 minutes)
- Spark SQL Concepts (30 minutes)
- PySpark Dataframe Concepts (30 minutes)
- 2 discussion prompts (Total 20 minutes)
- Meet and Greet (optional) (10 minutes)
- Let Us Know if Something's Not Working (10 minutes)
- 2 ungraded labs (Total 120 minutes)
- Practice: Creating RDD's with PySpark (60 minutes)
- Practice: Reading Data into Dataframes (60 minutes)
Module 2: Snowflake (4 hours)
- 8 videos (Total 27 minutes)
- What is Snowflake? (2 minutes, Preview module)
- Snowflake Layers (2 minutes)
- Snowflake Web UI (3 minutes)
- Navigating Snowflake (3 minutes)
- Creating a Table in Snowflake (5 minutes)
- Snowflake Warehouses (3 minutes)
- Writing to Snowflake (3 minutes)
- Reading from Snowflake (2 minutes)
- 5 readings (Total 50 minutes)
- Accessing Snowflake (10 minutes)
- Detailed View Inside Snowflake (10 minutes)
- Snowsight: The Snowflake Web Interface (10 minutes)
- Working with Warehouses (10 minutes)
- Python Connector Documentation (10 minutes)
- 6 quizzes (Total 180 minutes)
- Snowflake (30 minutes)
- Snowflake Architecture (30 minutes)
- Snowflake Layers (30 minutes)
- Navigating Snowflake (30 minutes)
- Creating a Table (30 minutes)
- Writing to Snowflake (30 minutes)
Module 3: Azure Databricks and MLFlow (5 hours)
- 16 videos (Total 71 minutes)
- Accessing Databricks (0 minutes, Preview module)
- Spark Notebooks with Databricks (4 minutes)
- Using Data with Databricks (4 minutes)
- Working with Workspaces in Databricks (3 minutes)
- Advanced Capabilities of Databricks (1 minute)
- PySpark Introduction on Databricks (7 minutes)
- Exploring Databricks Azure Features (3 minutes)
- Using the DBFS to AutoML Workflow (4 minutes)
- Load, Register and Deploy ML Models (2 minutes)
- Databricks Model Registry (2 minutes)
- Model Serving on Databricks (2 minutes)
- What is MLOps? (12 minutes)
- Exploring Open-Source MLFlow Frameworks (5 minutes)
- Running MLFlow with Databricks (6 minutes)
- End to End Databricks MLFlow (4 minutes)
- Databricks Autologging with MLFlow (4 minutes)
- 7 readings (Total 70 minutes)
- What is Azure Databricks? (10 minutes)
- Introduction to Databricks Machine Learning (10 minutes)
- What is the Databricks File System (DBFS)? (10 minutes)
- Serverless Compute with Databricks (10 minutes)
- MLOps Workflow on Azure Databricks (10 minutes)
- Run MLFlow Projects on Azure Databricks (10 minutes)
- Databricks Autologging (10 minutes)
- 4 quizzes (Total 120 minutes)
- DataBricks (30 minutes)
- PySpark SQL (30 minutes)
- PySpark DataFrames (30 minutes)
- MLFlow with Databricks (30 minutes)
- 1 ungraded lab (Total 60 minutes)
- ETL-Part-1: Keyword Extractor Tool to HashTag Tool (60 minutes)
Module 4: DataOps and Operations Methodologies (12 hours)
- 21 videos (Total 502 minutes)
- Kaizen Methodology for Data (4 minutes, Preview module)
- Introducing GitHub CodeSpaces (9 minutes)
- Compiling Python in GitHub Codespaces (18 minutes)
- Walking through Sagemaker Studio Lab (28 minutes)
- Pytest Master Class (Optional) (166 minutes)
- What is DevOps? (2 minutes)
- DevOps Key Concepts (35 minutes)
- Continuous Integration Overview (32 minutes)
- Build an NLP in Cloud9 with Python (43 minutes)
- Build a Continuously Deployed Containerized FastAPI Microservice (43 minutes)
- Hugo Continuous Deploy on AWS (18 minutes)
- Container Based Continuous Delivery (8 minutes)
- What is DataOps? (1 minute)
- DataOps and MLOps with Snowflake (61 minutes)
- Building Cloud Pipelines with Step Functions and Lambda (16 minutes)
- What is a Data Lake? (2 minutes)
- Data Warehouse vs. Feature Store (2 minutes)
- Big Data Challenges (1 minute)
- Types of Big Data Processing (1 minute)
- Real-World Data Engineering Pipeline (2 minutes)
- Data Feedback Loop (0 minutes)
- 6 readings (Total 60 minutes)
- GitHub Codespaces Overview (10 minutes)
- Getting Started with Amazon SageMaker Studio Lab (10 minutes)
- Teaching MLOps at Scale with GitHub (Optional) (10 minutes)
- Getting Started with DevOps and Cloud Computing (10 minutes)
- Benefits of Serverless ETL Technologies (10 minutes)
- Next Steps (10 minutes)
- 4 quizzes (Total 120 minutes)
- DataOps and Operations Methodologies (30 minutes)
- Kaizen Methodology (30 minutes)
- DevOps (30 minutes)
- DataOps (30 minutes)
- 1 ungraded lab (Total 60 minutes)
- ETL-Part2: SQLite ETL Destination (60 minutes)
Taught by
Noah Gift and Kennedy Behrman
Related Courses
AZ-400: Designing and Implementing Microsoft DevOps SolutionsA Cloud Guru Building a Continuous Integration Pipeline with Travis CI
A Cloud Guru Certified Jenkins Engineer
A Cloud Guru CloudFormation Deep Dive
A Cloud Guru DevOps Concepts
A Cloud Guru