Preparing Data for Modeling with scikit-learn
Jun 24, 2019 • 18 Minute Read
Introduction
Data preparation often takes eighty percent of the data scientist's time in a data science project, which emphasizes its importance in the machine learning life-cycle.
In this guide, you will learn the basics and implementation of several data preparation techniques, mentioned below:
- Dealing with Incorrect Entries
- Missing Value Treatment
- Encoding Categorical Labels
- Handling Outliers
- Logarithmic Transformation
- Standardization
- Converting the Column Types
Data
In this guide, we will be using fictitious data of loan applicants which contains 600 observations and 10 variables, as described below:
- Marital_status - Whether the applicant is married ("1") or not ("0").
- Dependents - Number of dependents claimed by the applicant.
- Is_graduate - Whether the applicant is a graduate ("1") or not ("0").
- Income - Annual Income of the applicant (in hundreds of dollars).
- Loan_amount - Loan amount (in hundreds of dollars) for which the application was submitted.
- Term_months - Tenure of the loan (in months).
- Credit_score - Whether the applicant's credit score was good ("1") or not ("0").
- Age - The applicant’s age in years.
- Sex - Whether the applicant is female (F) or male (M).
- approval_status - Whether the loan application was approved ("1") or not ("0"). This is the dependent variable.
Let's start by loading the required libraries and modules.
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
Reading the Data and Performing Basic Data Checks
The first line of code below reads in the data as a pandas dataframe, while the second line prints the shape - 600 observations of 10 variables. The third line gives the summary statistics of the variables.
# Load data
dat2 = pd.read_csv("data_prep.csv")
print(dat2.shape)
dat2.describe()
Output:
(600, 10)
| | Marital_status | Dependents | Is_graduate | Income | Loan_amount | Term_months | Credit_score | approval_status | Age |
|------- |---------------- |------------ |------------- |--------------- |------------- |------------- |-------------- |----------------- |------------ |
| count | 600.000000 | 598.000000 | 599.000000 | 600.000000 | 600.000000 | 600.00000 | 600.000000 | 600.000000 | 600.000000 |
| mean | 0.651667 | 0.730769 | 2.449082 | 7210.720000 | 161.571667 | 367.10000 | 0.788333 | 0.686667 | 51.766667 |
| std | 0.476840 | 0.997194 | 40.788143 | 8224.445086 | 93.467598 | 63.40892 | 0.408831 | 0.464236 | 21.240704 |
| min | 0.000000 | 0.000000 | 0.000000 | 200.000000 | 10.000000 | 36.00000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 1.000000 | 3832.500000 | 111.000000 | 384.00000 | 1.000000 | 0.000000 | 36.000000 |
| 50% | 1.000000 | 0.000000 | 1.000000 | 5075.000000 | 140.000000 | 384.00000 | 1.000000 | 1.000000 | 51.000000 |
| 75% | 1.000000 | 1.000000 | 1.000000 | 7641.500000 | 180.500000 | 384.00000 | 1.000000 | 1.000000 | 64.000000 |
| max | 1.000000 | 3.000000 | 999.000000 | 108000.000000 | 778.000000 | 504.00000 | 1.000000 | 1.000000 | 200.000000 |
Dealing with Incorrect Entries
The above output shows that the variable 'Age' has minimum and maximum values of 0 and 200, respectively. Also, the variable 'Is_graduate' has a maximum value of 999, instead of the binary values of '0' and '1'. These entries are incorrect and needs correction. One approach would be to delete these records but instead, we will treat these records as missing values and replace them with a measure of central tendency - i.e., mean, median, or mode.
Starting with the 'Age' variable, the first two lines of code below replace the incorrect values '0' and '200' with 'NaN', an indicator of missing values. We repeat the same process for the variable 'Is_graduate' in the third line of code. The fourth line prints the information about the variables.
dat2.Age.replace(0, np.nan, inplace=True)
dat2.Age.replace(200, np.nan, inplace=True)
dat2.Is_graduate.replace(999, np.nan, inplace=True)
dat2.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 10 columns):
Marital_status 600 non-null int64
Dependents 598 non-null float64
Is_graduate 598 non-null float64
Income 600 non-null int64
Loan_amount 600 non-null int64
Term_months 600 non-null int64
Credit_score 600 non-null int64
approval_status 600 non-null int64
Age 594 non-null float64
Sex 595 non-null object
dtypes: float64(3), int64(6), object(1)
memory usage: 47.0+ KB
Now, the variables 'Age' and 'Is_graduate' have 594 and 598 records, respectively. The left out entries have been tagged as missing, which we will learn about in the next section.
Missing Value Treatment
There are various techniques for handling missing values. The most widely used one is replacing the values with the measures of central tendency. The first line of code below replaces the missing values of the 'Age' variable with the mean of the remaining values. The second line replaces the missing values of the 'Is_graduate' variable with the value of '1', which indicates that the applicant's education status is 'graduate'. The third line gives the summary statistics of the variables.
dat2['Age'].fillna(dat2['Age'].mean(), inplace=True)
dat2['Is_graduate'].fillna(1,inplace=True)
dat2.describe()
Output:
| | Marital_status | Dependents | Is_graduate | Income | Loan_amount | Term_months | Credit_score | approval_status | Age |
|------- |---------------- |------------ |------------- |--------------- |------------- |------------- |-------------- |----------------- |------------ |
| count | 600.000000 | 598.000000 | 600.000000 | 600.000000 | 600.000000 | 600.00000 | 600.000000 | 600.000000 | 600.000000 |
| mean | 0.651667 | 0.730769 | 0.783333 | 7210.720000 | 161.571667 | 367.10000 | 0.788333 | 0.686667 | 50.606061 |
| std | 0.476840 | 0.997194 | 0.412317 | 8224.445086 | 93.467598 | 63.40892 | 0.408831 | 0.464236 | 16.184651 |
| min | 0.000000 | 0.000000 | 0.000000 | 200.000000 | 10.000000 | 36.00000 | 0.000000 | 0.000000 | 22.000000 |
| 25% | 0.000000 | 0.000000 | 1.000000 | 3832.500000 | 111.000000 | 384.00000 | 1.000000 | 0.000000 | 36.000000 |
| 50% | 1.000000 | 0.000000 | 1.000000 | 5075.000000 | 140.000000 | 384.00000 | 1.000000 | 1.000000 | 50.606061 |
| 75% | 1.000000 | 1.000000 | 1.000000 | 7641.500000 | 180.500000 | 384.00000 | 1.000000 | 1.000000 | 64.000000 |
| max | 1.000000 | 3.000000 | 1.000000 | 108000.000000 | 778.000000 | 504.00000 | 1.000000 | 1.000000 | 80.000000 |
The corrections have now been made in both of the variables. The data also has a variable, 'Sex', with five missing values. Since this is a categorical variable, we will check the distribution of labels, which is done in the line of code below.
dat2['Sex'].value_counts()
Output:
M 484
F 111
Name: Sex, dtype: int64
The output shows that 484 out of 595 applicants are male, so we will replace the missing values with label 'M'. The first line of code below performs this task, while the second line prints the distribution of the variable. The output shows 600 records for the 'Sex' variable, which means the missing values have been accounted for.
dat2['Sex'].fillna('M',inplace=True)
dat2['Sex'].value_counts()
Output:
M 489
F 111
Name: Sex, dtype: int64
We will now check if any more variables have missing values, which is done in the line of code below. The output shows that we still have two missing values in the variable 'Dependents'.
dat2.isnull().sum()
Output:
Marital_status 0
Dependents 2
Is_graduate 0
Income 0
Loan_amount 0
Term_months 0
Credit_score 0
approval_status 0
Age 0
Sex 0
dtype: int64
Since there are only two missing values in the dataset, we will learn another approach for dropping records with missing values. The first line of code below uses the 'dropna()' function to drop rows with any missing values in it, while the second line checks the information about the dataset.
dat2 = dat2.dropna()
dat2.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 598 entries, 0 to 599
Data columns (total 10 columns):
Marital_status 598 non-null int64
Dependents 598 non-null float64
Is_graduate 598 non-null float64
Income 598 non-null int64
Loan_amount 598 non-null int64
Term_months 598 non-null int64
Credit_score 598 non-null int64
approval_status 598 non-null int64
Age 598 non-null float64
Sex 598 non-null object
dtypes: float64(3), int64(6), object(1)
memory usage: 51.4+ KB
Encoding Categorical Labels
The missing values have been treated in the data, but the labels in the variable 'Sex' use letters ('M' and 'F'). For modeling using scikit-learn, all the variables should be numeric, so we will have to change the labels. Since there are two labels, we can do binary encoding which is done in the first line of code below. The output from the second line shows that we have successfully performed the encoding.
dat2["Sex"] = dat2["Sex"].map({"M": 0, "F":1})
dat2['Sex'].value_counts()
Output:
0 487
1 111
Name: Sex, dtype: int64
Handling Outliers
One of the biggest obstacles in predictive modeling can be the presence of outliers which are extreme values that are different from the other data points. Outliers are often a problem because they mislead the training process and lead to inaccurate models.
For numerical variables, we can identify outliers visually through a histogram or numerically through the skewness value. The two lines of code below plot the histogram along with the skewness value for the 'Income' variable.
plot1 = sns.distplot(dat2["Income"], color="b", label="Skewness : %.1f"%(dat2["Income"].skew()))
plot1 = plot1.legend(loc="best")
Output:
The histogram shows that the variable 'Income' has a right-skewed distribution with the skewness value of 6.5. Ideally, the skewness value should be between -1 and 1.
Apart from the variable 'Income', we also have other variables ('Loan_amount' and 'Age') that have differences in scale which require normalization. We will learn a couple of techniques in the subsequent sections to deal with these preprocessing problems.
Logarithmic Transformation of Numerical Variables
The previous chart showed that the variable 'Income' is skewed. One of the ways to make its distribution normal is by logarithmic transformation. The first line of code below creates a new variable, 'LogIncome', while the second and third lines of code plot the histogram and skewness value of this new variable.
dat2["LogIncome"] = dat2["Income"].map(lambda i: np.log(i) if i > 0 else 0)
plot2 = sns.distplot(dat2["LogIncome"], color="m", label="Skewness : %.1f"%(dat2["LogIncome"].skew()))
plot2 = plot2.legend(loc="best")
Output:
The above chart shows that taking the log of the 'Income' variable makes the distribution roughly normal and reduces the skewness. We can use the same transformation for other numerical variables, but, instead, we will learn another transformation technique called Standardization.
Standardization
Several machine learning algorithms use some form of a distance matrix to learn from the data. However, when the features are using different scales, such as 'Age' in years and 'Income' in hundreds of dollars, the features using larger scales can unduly influence the model. As a result, we want the features to be using a similar scale that can be achieved through scaling techniques.
One such technique is standardization, in which all the features are centered around zero and have, roughly, unit variance.The first line of code below imports the 'StandardScaler' from the 'sklearn.preprocessing' module. The second line does the normalization for the three variables, 'Income','Loan_amount', and 'Age'. Finally, the third line prints the variance of the scaled variables.
from sklearn.preprocessing import StandardScaler
dat2[['Income','Loan_amount', 'Age']] = StandardScaler().fit_transform(dat2[['Income','Loan_amount', 'Age']])
print(dat2['Income'].var()); print(dat2['Loan_amount'].var()); print(dat2['Age'].var())
Output:
1.0016750418760463
1.0016750418760472
1.001675041876044
There is one variance for all the standardized variables. Let us now look at the variables after all the preprocessing till now.
print(dat2.info())
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 598 entries, 0 to 599
Data columns (total 11 columns):
Marital_status 598 non-null int64
Dependents 598 non-null float64
Is_graduate 598 non-null float64
Income 598 non-null float64
Loan_amount 598 non-null float64
Term_months 598 non-null int64
Credit_score 598 non-null int64
approval_status 598 non-null int64
Age 598 non-null float64
Sex 598 non-null int64
LogIncome 598 non-null float64
dtypes: float64(6), int64(5)
memory usage: 56.1 KB
None
Converting the Column Types
The two variables, 'Dependents' and 'Is_graduate', have been read as 'float64' which indicates numeric variables with a decimal value. This is not correct, as both of these variables are taking integer values. For carrying out any mathematical operations on the variables during the modeling process, it is important that the variables have the correct data types.
The first two lines of code below converts these variables to the integer data type, while the third line prints the data types of the variables.
dat2["Dependents"] = dat2["Dependents"].astype("int")
dat2["Is_graduate"] = dat2["Is_graduate"].astype("int")
print(dat2.dtypes)
Output:
Marital_status int64
Dependents int32
Is_graduate int32
Income float64
Loan_amount float64
Term_months int64
Credit_score int64
approval_status int64
Age float64
Sex int64
LogIncome float64
dtype: object
The data type for the variables, 'Dependents' and 'Is_graduate', have been corrected. We have created an additional variable, 'LogIncome', to demonstrate logarithmic transformation, however, the same transformation could have been applied to the 'Income' variable without creating a new one.
All the variables now seem to be in the right form and we can use the modeling to predict 'approval_status' of the loan applications. However, that is not within the scope of this guide and you can learn about them through other pluralsight guides on scikit-learn whose links are given in the end.
Conclusion
In this guide, you have learned about the fundamental techniques of data preprocessing for machine learning. You learned about dealing with missing values, identifying and treating outliers, normalizing and transforming data, and converting the data types.
To learn more about building machine learning models using scikit-learn , please refer to the following guides:
- Scikit Machine Learning
- Linear, Lasso, and Ridge Regression with scikit-learn
- Non-Linear Regression Trees with scikit-learn
- Machine Learning with Neural Networks Using scikit-learn
- Validating Machine Learning Models with scikit-learn
- Ensemble Modeling with scikit-learn
To learn more about building deep learning models using keras , please refer to the following guides: