Data Science Foundations: Data Assessment for Predictive Modeling
Offered By: LinkedIn Learning
Course Description
Overview
Explore the data understanding phase of the CRISP-DM methodology for predictive modeling. Find out how to collect, describe, explore, and verify data.
Syllabus
Introduction
- Why data assessment is critical
- A note about the exercise files
- Clarifying how data understanding differs from data visualization
- Introducing the critical data understanding phase of CRISP-DM
- Data assessment in CRISP-DM alternatives: The IBM ASUM-DM and Microsoft TDSP
- Navigating the transition from business understanding to data understanding
- How to organize your work with the four data understanding tasks
- Considerations in gathering the relevant data
- A strategy for processing data sources
- Getting creative about data sources
- How to envision a proper flat file
- Anticipating data integration
- Reviewing basic concepts in the level of measurement
- What is dummy coding?
- Expanding our definition of level of measurement
- Taking an initial look at possible key variables
- Dealing with duplicate IDs and transactional data
- How many potential variables (columns) will I have?
- How to deal with high-order multiple nominals
- Challenge: Identifying the level of measurement
- Solution: Identifying the level of measurement
- Introducing the KNIME Analytics Platform
- Tips and tricks to consider during data loading
- Unit analysis decisions
- Challenge: What should the row be?
- Solution: What should the row be?
- How to uncover the gross properties of the data
- Researching the dataset
- Tips and tricks using simple aggregation commands
- A simple strategy for organizing your work
- Describe data demo using the UCI heart dataset
- Challenge: Practice describe data with the UCI heart dataset
- Solution: Practice describe data with the UCI heart dataset
- The explore data task
- How to be effective doing univariate analysis and data visualization
- Anscombe's quartet
- The Data Explorer node feature in KNIME
- How to navigate borderline cases of variable type
- How to be effective in doing bivariate data visualization
- Challenge: Producing bivariate visualizations for case study 1
- Solution: Producing bivariate visualizations for case study 1
- How to utilize an SME's time effectively
- Techniques for working with the top predictors
- Advice for weak predictors
- Tips and tricks when searching for quirks in your data
- Learning when to discard rows
- Introducing ggplot2
- Orientating to R's ggplot2 for powerful multivariate data visualizations
- Challenge: Producing multivariate visualizations for case study 1
- Solution: Producing multivariate visualizations for case study 1
- Exploring your missing data options
- Why you lose rows to listwise deletion
- Investigating the provenance of the missing data
- Introducing the KDD Cup 1998 data
- What is the pattern of missing data in your data?
- Is the missing data worth saving?
- Assessing imputation as a potential solution
- Exploring and verifying data quality with the UCI heart dataset
- Challenge: Quantifying missing data with the UCI heart dataset
- Solution: Quantifying missing data with the UCI heart dataset
- Why formal reports are important
- Creating a data prep to-do list
- How to prepare for eventual deployment
- Next steps
Taught by
Keith McCormick
Related Courses
Lean Data Approaches to Measure Social ImpactAcumen Academy Advanced Manufacturing Process Analysis
University at Buffalo via Coursera Artificial Intelligence Data Fairness and Bias
LearnQuest via Coursera AI in Healthcare Capstone
Stanford University via Coursera Google Data Analytics (PT)
Google via Coursera