Preparing Data for Machine Learning
This course covers important techniques in data preparation, data cleaning and feature selection that are needed to set your machine learning model up for success. You will also learn how to use imputation to deal with missing data and strategies for identifying and coping with outliers.
What you'll learn
As Machine Learning explodes in popularity, it is becoming ever more important to know precisely how to prepare the data going into the model in a manner appropriate to the problem we are trying to solve.
In this course, Preparing Data for Machine Learning* you will gain the ability to explore, clean, and structure your data in ways that get the best out of your machine learning model.
First, you will learn why data cleaning and data preparation are so important, and how missing data, outliers, and other data-related problems can be solved. Next, you will discover how models that read too much into data suffer from a problem called overfitting, in which models perform well under test conditions but struggle in live deployments. You will also understand how models that are trained with insufficient or unrepresentative data suffer from a different set of problems, and how these problems can be mitigated.
Finally, you will round out your knowledge by applying different methods for feature selection, dealing with missing data using imputation, and building your models using the most relevant features.
When you’re finished with this course, you will have the skills and knowledge to identify the right data procedures for data cleaning and data preparation to set your model up for success.
Table of contents
- Version Check 0m
- Module Overview 1m
- Prerequisites and Course Outline 2m
- The Need for Data Preparation 4m
- Insufficient Data 6m
- Too Much Data 4m
- Non-representative Data, Missing Values, Outliers, Duplicates 2m
- Dealing with Missing Data 5m
- Dealing with Outliers 6m
- Oversampling and Undersampling to Balance Datasets 4m
- Overfitting and Underfitting 3m
- Module Summary 2m
- Module Overview 1m
- Handling Missing Values 7m
- Cleaning Data 8m
- Visualizing Relationships 4m
- Building a Regression Model 8m
- Univariate Feature Imputation Using the Simple Imputer 7m
- Multivariate Feature Imputation Using the Iterative Imputer 6m
- Missing Value Indicator 2m
- Feature Imputation as a Part of an Machine Learning Pipeline 4m
- Module Summary 1m
- Module Overview 2m
- Numeric Data 6m
- Scaling and Standardizing Features 4m
- Normalizing and Binarizing Features 6m
- Categorical Data 3m
- Numeric Encoding of Categorical Data 5m
- Label Encoding and One-hot Encoding 8m
- Discretization of Continuous Values Using Pandas Cut 3m
- Discretization of Continuous Values Using the KBins Discretizer 4m
- Building a Regression Model with Discretized Data 3m
- Module Summary 1m
- Module Overview 1m
- Feature Correlations 8m
- Using the Correlation Matrix to Detect Multi-collinearity 5m
- Using Variance Inflation Factor to Detect Multi-collinearity 3m
- Features Selection Using Missing Values Threshold and Variance Threshold 6m
- Univariate Feature Selection Using Chi2 and ANOVA 7m
- Feature Selection Using Wrapper Methods 8m
- Feature Selection Using Embedded Methods 4m
- Module Summary 1m