Data Ingestion with Python
Offered By: LinkedIn Learning
Course Description
Overview
Learn how to use Python tools and techniques to solve one of the main challenges data scientists face: getting good data to train their algorithms.
Syllabus
Introduction
- Why is data ingestion important?
- What you should know
- Using the exercise files
- Using the Coderpad quizzes
- Overview of data scientists work
- Where does data come from?
- Different types of data
- The data pipeline (ETL)
- Final destination (data lake)
- Working in CSV
- Working in XML
- Working in Parquet, Avro, and ORC
- Unstructured text
- JSON
- Solution: CSV to JSON
- Working with JSON
- Making HTTP calls
- Processing event-based data
- Solution: Location from IP
- Try to find an API
- Working with Beautiful Soup
- Working with Scrapy
- Working with Selenium
- Other considerations
- Solution: Get stock information from HTML
- What are schemas?
- Working with ontologies
- What should be in schema
- Schema changes
- Schema validations
- Types of databases
- Hosted and cost of ops
- Working with relational databases
- Working with key or value databases
- Working with document databases
- Working with graph databases
- Solution: ETL
- Data is never 100% okay
- Causes of errors
- Filling missing values
- Finding outliers (manual)
- Finding outliers (ML)
- Solution: Clean rides dataset
- Design your data
- KPIs
- What to monitor?
- Next steps
Taught by
Miki Tebeka
Related Courses
Web DevelopmentUdacity Do-It-Yourself Geo Apps
Esri via Independent Software Construction: Object-Oriented Design
The University of British Columbia via edX Full-Text Search with SAP HANA Platform
SAP Learning Tools for Data Science
IBM via Coursera