Data Science Foundations: Data Assessment for Predictive Modeling
Offered By: LinkedIn Learning
Course Description
Overview
Explore the data understanding phase of the CRISP-DM methodology for predictive modeling. Find out how to collect, describe, explore, and verify data.
Syllabus
Introduction
- Why data assessment is critical
- A note about the exercise files
- Clarifying how data understanding differs from data visualization
- Introducing the critical data understanding phase of CRISP-DM
- Data assessment in CRISP-DM alternatives: The IBM ASUM-DM and Microsoft TDSP
- Navigating the transition from business understanding to data understanding
- How to organize your work with the four data understanding tasks
- Considerations in gathering the relevant data
- A strategy for processing data sources
- Getting creative about data sources
- How to envision a proper flat file
- Anticipating data integration
- Reviewing basic concepts in the level of measurement
- What is dummy coding?
- Expanding our definition of level of measurement
- Taking an initial look at possible key variables
- Dealing with duplicate IDs and transactional data
- How many potential variables (columns) will I have?
- How to deal with high-order multiple nominals
- Challenge: Identifying the level of measurement
- Solution: Identifying the level of measurement
- Introducing the KNIME Analytics Platform
- Tips and tricks to consider during data loading
- Unit analysis decisions
- Challenge: What should the row be?
- Solution: What should the row be?
- How to uncover the gross properties of the data
- Researching the dataset
- Tips and tricks using simple aggregation commands
- A simple strategy for organizing your work
- Describe data demo using the UCI heart dataset
- Challenge: Practice describe data with the UCI heart dataset
- Solution: Practice describe data with the UCI heart dataset
- The explore data task
- How to be effective doing univariate analysis and data visualization
- Anscombe's quartet
- The Data Explorer node feature in KNIME
- How to navigate borderline cases of variable type
- How to be effective in doing bivariate data visualization
- Challenge: Producing bivariate visualizations for case study 1
- Solution: Producing bivariate visualizations for case study 1
- How to utilize an SME's time effectively
- Techniques for working with the top predictors
- Advice for weak predictors
- Tips and tricks when searching for quirks in your data
- Learning when to discard rows
- Introducing ggplot2
- Orientating to R's ggplot2 for powerful multivariate data visualizations
- Challenge: Producing multivariate visualizations for case study 1
- Solution: Producing multivariate visualizations for case study 1
- Exploring your missing data options
- Why you lose rows to listwise deletion
- Investigating the provenance of the missing data
- Introducing the KDD Cup 1998 data
- What is the pattern of missing data in your data?
- Is the missing data worth saving?
- Assessing imputation as a potential solution
- Exploring and verifying data quality with the UCI heart dataset
- Challenge: Quantifying missing data with the UCI heart dataset
- Solution: Quantifying missing data with the UCI heart dataset
- Why formal reports are important
- Creating a data prep to-do list
- How to prepare for eventual deployment
- Next steps
Taught by
Keith McCormick
Related Courses
Big Data Analytics in HealthcareGeorgia Institute of Technology via Udacity Model Building and Validation
AT&T via Udacity Maths for Humans: Linear, Quadratic & Inverse Relations
University of New South Wales via FutureLearn Regression Modeling in Practice
Wesleyan University via Coursera Data Science at Scale - Capstone Project
University of Washington via Coursera