Data Engineering
Offered By: IBM via edX
Course Description
Overview
Organizations have more data at their disposal today than ever before. The vast amount of data that organizations are capturing, along with their desire to extract meaningful insights is driving an urgent demand for Data Engineers.
Data Engineers play a fundamental role in harnessing data that enable organizations to apply business intelligence for making informed decisions. Today’s Data Engineers require a broad set of skills to develop and optimize data systems and make data available to the organization for analysis.
This Professional Certificate provides you the job-ready skills you will need to launch your career as an entry level data engineer.
Upon completing this Professional Certificate, you will have extensive knowledge and practical experience with cloud-based relational databases (RDBMS) and NoSQL data repositories, working with Python, Bash and SQL, processing big data with Apache Hadoop and Apache Spark, using ETL (extract, transform and load) tools, creating data pipelines, using Apache Kafka and Airflow, designing, populating, and querying data warehouses and utilizing business intelligence tools.
Within each course, you’ll gain practical experience with hands-on labs and projects for building your portfolio. In the final Capstone project, you’ll apply your knowledge and skills attained throughout this program and demonstrate your ability to perform as a Data Engineer.
This program does not require any prior data engineering or programming experience.
Syllabus
Course 1: Data Engineering Basics for Everyone
Learn about data engineering concepts, ecosystem, and lifecycle. Also learn about the systems, processes, and tools you need as a Data Engineer in order to gather, transform, load, process, query, and manage data so that it can be leveraged by data consumers for operations, and decision-making.
Course 2: Python Basics for Data Science
This Python course provides a beginner-friendly introduction to Python for Data Science. Practice through lab exercises, and you'll be ready to create your first Python scripts on your own!
Course 3: Python for Data Engineering Project
An opportunity to apply your foundational Python skills via a project, using various techniques to collect and work with data
Course 4: Relational Database Basics
This course teaches you the fundamental concepts of relational databases and Relational Database Management Systems (RDBMS) such as MySQL, PostgreSQL, and IBM Db2.
Course 5: SQL for Data Science
Learn how to use and apply the powerful language of SQL to better communicate and extract data from databases - a must for anyone working in the data science field.
Course 6: SQL Concepts for Data Engineers
In this short course you will learn additional SQL concepts such as views, stored procedures, transactions and joins.
Course 7: Linux Commands & Shell Scripting
This mini-course describes shell commands and how to use the advanced features of the Bash shell to automate complicated database tasks. For those not familiar with shell scripting, this course provides an overview of common Linux Shell Commands and shell scripting basics.
Course 8: Relational Database Administration (DBA)
This course helps you develop the foundational skills required to perform the role of a Database Administrator (DBA) including designing, implementing, securing, maintaining, troubleshooting and automating databases such as MySQL, PostgreSQL and Db2.
Course 9: Building ETL and Data Pipelines with Bash, Airflow and Kafka
This course provides you with practical skills to build and manage data pipelines and Extract, Transform, Load (ETL) processes using shell scripts, Airflow and Kafka.
Course 10: Data Warehousing and BI Analytics
This course introduces you to designing, implementing and populating a data warehouse and analyzing its data using SQL & Business Intelligence (BI) tools.
Course 11: NoSQL Database Basics
This course introduces you to the fundamentals of NoSQL, including the four key non-relational database categories. By the end of the course you will have hands-on skills for working with MongoDB, Cassandra and IBM Cloudant NoSQL databases.
Course 12: Big Data, Hadoop, and Spark Basics
This course provides foundational big data practitioner knowledge and analytical skills using popular big data tools, including Hadoop and Spark. Learn and practice your big data skills hands-on.
Course 13: Apache Spark for Data Engineering and Machine Learning
This short course introduces you to the fundamentals of Data Engineering and Machine Learning with Apache Spark, including Spark Structured Streaming, ETL for Machine Learning (ML) Pipelines, and Spark ML. By the end of the course, you will have hands-on experience applying Spark skills to ETL and ML workflows.
Course 14: Data Engineering Capstone Project
This Capstone Project is designed for you to apply and demonstrate your Data Engineering skills and knowledge in SQL, NoSQL, RDBMS, Bash, Python, ETL, Data Warehousing, BI tools and Big Data.
Courses
-
Please Note: Learners who successfully complete this IBM course can earn a skill badge —a detailed, verifiable and digital credential that profiles the knowledge and skills you’ve acquired in this course. Enroll to learn more, complete the course and claim your badge!
Kickstart your learning of Python for data science, as well as programming in general with this introduction to Python course. This beginner-friendly Python course will quickly take you from zero to programming in Python in a matter of hours and give you a taste of how to start working with data in Python. ~~~~
Upon its completion, you'll be able to write your own Python scripts and perform basic hands-on data analysis using our Jupyter-based lab environment. If you want to learn Python from scratch, this course is for you.
You can start creating your own data science projects and collaborating with other data scientists using IBM Watson Studio. When you sign up, you will receive free access to Watson Studio. Start now and take advantage of this platform and learn the basics of programming, machine learning, and data visualization with this introductory course.
-
Please Note: Learners who successfully complete this IBM course can earn a skill badge — a detailed, verifiable and digital credential that profiles the knowledge and skills you’ve acquired in this course. Enroll to learn more, complete the course and claim your badge!
Much of the world's data lives in databases. SQL (or Structured Query Language) is a powerful programming language that is used for communicating with and extracting various data types from databases. A working knowledge of databases and SQL is necessary to advance as a data scientist or a machine learning specialist. The purpose of this course is to introduce relational database concepts and help you learn and apply foundational knowledge of the SQL language. It is also intended to get you started with performing SQL access in a data science environment.
The emphasis in this course is on hands-on, practical learning. As such, you will work with real databases, real data science tools, and real-world datasets. You will create a database instance in the cloud. Through a series of hands-on labs, you will practice building and running SQL queries. You will also learn how to access databases from Jupyter notebooks using SQL and Python.
No prior knowledge of databases, SQL, Python, or programming is required.
-
This course will provide you with technical hands-on knowledge of NoSQL databases and Database-as-a-Service (DaaS) offerings. With the advent of Big Data and agile development methodologies, NoSQL databases have gained a lot of relevance in the database landscape. Their main advantage is the ability to effectively handle scalability and flexibility issues raised by modern applications.
You will start by learning the history and the basics of NoSQL databases and discover their key characteristics and benefits. You will learn about the four categories of NoSQL databases and how they differ from each other.
You will explore the architecture and features of several different implementations of NoSQL databases, namely MongoDB, Cassandra, and IBM Cloudant.
Throughout the course you will get practical experience using these NoSQL databases to perform standard database management tasks, such as creating and replicating databases, loading and querying data, modifying database permissions, indexing and aggregating data, and sharding (or partitioning) data.
The course ends with a hands-on project to test your understanding of some of the basics of working with several NoSQL database offerings.
-
Organizations need skilled, forward-thinking Big Data practitioners who can apply their business and technical skills to unstructured data such as tweets, posts, pictures, audio files, videos, sensor data, and satellite imagery, and more, to identify behaviors and preferences of prospects, clients, competitors, and others. ****
This course introduces you to Big Data concepts and practices. You will understand the characteristics, features, benefits, limitations of Big Data and explore some of the Big Data processing tools. You'll explore how Hadoop, Hive, and Spark can help organizations overcome Big Data challenges and reap the rewards of its acquisition.
Hadoop, an open-source framework, enables distributed processing of large data sets across clusters of computers using simple programming models. Each computer, or node, offers local computation and storage, allowing datasets to be processed faster and more efficiently. Hive, a data warehouse software, provides an SQL-like interface to efficiently query and manipulate large data sets in various databases and file systems that integrate with Hadoop.
Open-source Apache Spark is a processing engine built around speed, ease of use, and analytics that provides users with newer ways to store and use big data.
You will discover how to leverage Spark to deliver reliable insights. The course provides an overview of the platform, going into the different components that make up Apache Spark. In this course, you will also learn how Resilient Distributed Datasets, known as RDDs, enable parallel processing across the nodes of a Spark cluster.
You'll gain practical skills when you learn how to analyze data in Spark using PySpark and Spark SQL and how to create a streaming analytics application using Spark Streaming, and more.
-
Apache® Spark™ is a fast, flexible, and developer-friendly open-source platform for large-scale SQL, batch processing, stream processing, and machine learning. Users can take advantage of its open-source ecosystem, speed, ease of use, and analytic capabilities to work with Big Data in new ways.
In this short course, you explore concepts and gain hands-on skills to use Spark for data engineering and machine learning applications. You'll learn about Spark Structured Streaming, including data sources, output modes, operations. Then, explore how Graph theory works and discover how GraphFrames supports Spark DataFrames and popular algorithms.
Organizations can acquire data from structured and unstructured sources and deliver the data to users in formats they can use. Learn how to use Spark for extract, transform and load (ETL) data. Then, you'll hone your newly acquired skills during your "ETL for Machine Learning Pipelines" lab.
Next, discover why machine learning practitioners prefer Spark. You'll learn how to create pipelines and quickly implement features for extraction, selections, and transformations on structured data sets. Discover how to perform classification and regression using Spark. You'll be able to define and identify both supervised and unsupervised learning. Learn about clustering and how to apply the k-mean s clustering algorithm using Spark MLlib. You'll reinforce your knowledge with focused, hands-on labs and a final project where you will apply Spark to a real-world inspired problem.
Prior to taking this course, please ensure you have foundational Spark knowledge and skills, for example, by first completing the IBM course titled "Big Data, Hadoop and Spark Basics."
-
This mini-course provides a practical introduction to commonly used Linux / UNIX shell commands and teaches you basics of Bash shell scripting to automate a variety of tasks. The course includes both video-based lectures as well as hands-on labs to practice and apply what you learn. You will have no-charge access to a virtual Linux server that you can access through your web browser, so you don't need to download and install anything to perform the labs.
In this course you will work with general purpose commands, like id, date, uname, ps, top, echo, man; directory manageent commands such as pwd, cd, mkdir, rmdir, find, df; file management commands like cat, wget, more, head, tail, cp, mv, touch, tar, zip, unzip; access control command chmod; text processing commands - wc, grep, tr; as well as networking commands - hostname, ping, ifconfig and curl. You will create simple to intermediate shell scripts that involve Metacharacters, Quoting, Variables, Command substitution, I/O Redirection, Pipes & Filters, and Command line arguments. You will also schedule cron jobs using crontab.
This course provides essential hands-on skills for data engineers, data scientists, software developers, and cloud practitioners who want to get familiar with frequently used commands on Linux, MacOS and other Unix-like operating systems as well as get started with creating shell scripts.
-
Well-designed and automated data pipelines and ETL processes are the foundation of a successful Business Intelligence platform. Defining your data workflows, pipelines and processes early in the platform design ensures the right raw data is collected, transformed and loaded into desired storage layers and available for processing and analysis as and when required.
This course is designed to provide you the critical knowledge and skills needed by Data Engineers and Data Warehousing specialists to create and manage ETL, ELT, and data pipeline processes.
Upon completing this course you’ll gain a solid understanding of Extract, Transform, Load (ETL), and Extract, Load, and Transform (ELT) processes; practice extracting data, transforming data, and loading transformed data into a staging area; create an ETL data pipeline using Bash shell-scripting, build a batch ETL workflow using Apache Airflow and build a streaming data pipeline using Apache Kafka.
You’ll gain hands-on experience with practice labs throughout the course and work on a real-world inspired project to build data pipelines using several technologies that can be added to your portfolio and demonstrate your ability to perform as a Data Engineer.
This course pre-requisites that you have prior skills to work with datasets, SQL, relational databases, and Bash shell scripts.
-
In this Capstone you’ll demonstrate your ability to perform like a Data Engineer. Your mission is to design, implement, and manage a complete data and analytics platform consisting of relational and non-relational databases, data warehouses, data pipelines, big data processing engines, and Business Intelligence (BI) tools.
This Capstone project will require that you apply and sharpen the skills and knowledge you developed in the various courses in the IBM Data Engineering Professional Certificate and utilize multiple tools and technologies to design databases, collect data from multiple sources, extract, transform and load data into a data warehouse, and utilize a cloud-based BI tool to create analytic reports and visualizations. You will also implement predictive analytics and machine learning models using big data tools and techniques.
This capstone requires significant amount of hands-on lab effort throughout the course. You’ll exhibit your knowledge and proficiency working with Python, Bash scripts, SQL, NoSQL, RDBMSes, ETL, MySQL, PostgreSQL, Db2, MongoDB, Apache Airflow, Apache Spark, and Cognos Analytics.
Upon successfully completing this Capstone, you should have the confidence and portfolio to take on real-world data engineering projects and showcase your abilities to perform as an entry-level data engineer.
-
Today’s businesses are investing significantly in capabilities to harness the massive amounts of data that fuel Business Intelligence (BI). Working knowledge of Data Warehouses and BI Analytics tools are a crucial skill for Data Engineers, Data Warehousing Specialists and BI Analysts, making who are amongst, the most valued resources for organizations.
This course prepares you with the skills and hands-on experience to design, implement and maintain enterprise data warehouse systems and business intelligence tools. You’ll gain extensive knowledge on various data repositories including data marts, data lakes and data reservoirs, explore data warehousing system architectures, deepen on data cubes and data organization using related tables. And analyze data using business intelligence like Cognos Analytics, including its reporting and dashboard features, and interactive visualization capabilities.
This course provides hands-on experience with practice labs and a real-world inspired project that can be added to your portfolio and will demonstrate your proficiency in working with data warehouses. Skills you will gain include building data warehouses, Star/Snowflake schemas, CUBEs, ROLLUPs, Materialized Views/MQTs, reports and dashboards.
This course assumes prior SQL and relational database experience.
-
Managing databases is a critical skill for Data Engineers and Database Administrators to ensure data is reliable, protected and easily accessible for organizations to make better decisions, solve problems and create business value.
With the amount of data continually expanding and business leaders focused on building data-literate organizations, it’s no surprise that Database Administrators are in high demand and earn a median salary of US $98,860 per year according to the US Bureau of Labor Statistics.
This course provides you with the knowledge and hands-on experience to manage and maintain databases, understand database security, design and define database schemas, tables, views, and other database objects, describe storage, perform backups and recovery, troubleshoot errors, monitor and optimize performance and automate tasks.
This course includes hands-on practice labs and a real-world inspired project to add to your portfolio that will demonstrate your ability to perform the Database Administration tasks using relational databases (RDBMSes) such as MySQL, PostgreSQL and IBM Db2.
Prior knowledge of database fundamentals and SQL is required to complete this course.
Taught by
Aije Egwaikhide, Karthik Muthuraman, Romeo Kienzler, Rav Ahuja, Jeff Grossman, Steve Ryan, Ramesh Sannareddy, Joseph Santarcangelo, Lin Joyner, Rose Malcolm and Yan Luo
Tags
Related Courses
Design Computing: 3D Modeling in Rhinoceros with Python/RhinoscriptUniversity of Michigan via Coursera A Practical Introduction to Test-Driven Development
LearnQuest via Coursera FinTech for Finance and Business Leaders
ACCA via edX Access Bioinformatics Databases with Biopython
Coursera Project Network via Coursera Accounting Data Analytics
University of Illinois at Urbana-Champaign via Coursera