Cleaning Data with PySpark
Offered By: DataCamp
Course Description
Overview
Learn how to clean data with Apache Spark in Python.
Working with data is tricky - working with millions or even billions of rows is worse.
Did you receive some data processing code written on a laptop with fairly pristine data?
Chances are you’ve probably been put in charge of moving a basic data process from prototype to production.
You may have worked with real world datasets, with missing fields, bizarre formatting, and orders of magnitude more data. Even if this is all new to you, this course helps you learn what’s needed to prepare data processes using Python with Apache Spark.
You’ll learn terminology, methods, and some best practices to create a performant, maintainable, and understandable data processing platform.
Working with data is tricky - working with millions or even billions of rows is worse.
Did you receive some data processing code written on a laptop with fairly pristine data?
Chances are you’ve probably been put in charge of moving a basic data process from prototype to production.
You may have worked with real world datasets, with missing fields, bizarre formatting, and orders of magnitude more data. Even if this is all new to you, this course helps you learn what’s needed to prepare data processes using Python with Apache Spark.
You’ll learn terminology, methods, and some best practices to create a performant, maintainable, and understandable data processing platform.
Syllabus
- DataFrame details
- A review of DataFrame fundamentals and the importance of data cleaning.
- Manipulating DataFrames in the real world
- A look at various techniques to modify the contents of DataFrames in Spark.
- Improving Performance
- Improve data cleaning tasks by increasing performance or reducing resource requirements.
- Complex processing and data pipelines
- Learn how to process complex real-world data using Spark and the basics of pipelines.
Taught by
Mike Metzger
Related Courses
Python aplicado a la Ciencia de DatosUniversidad Anáhuac via edX Analisis Data dengan Pemrograman R
Google via Coursera Análisis de datos con programación en R
Google via Coursera Análisis de datos con Python
IBM via Coursera Análisis de Datos de Google
Google via Coursera