Preparing Data for Feature Engineering and Machine Learning
Jul 29, 2019 • 25 Minute Read
Introduction
Great machine learning models are often the ones with high quality features as the inputs to the model. This means that data preparation and feature engineering becomes imperative. In this guide, the reader will learn how to prepare data and create powerful features for predictive modeling. We will begin by exploring the data.
Data
In this guide, we will be using a fictitious data of loan applicants which contains 600 observations and 11 variables, as described below:
-
Marital_status - Whether the applicant is married ("Yes") or not ("No").
-
Dependents - Number of dependents claimed by the applicant.
-
Is_graduate - Whether the applicant is a graduate ("Yes") or not ("No").
-
Income - Annual Income of the applicant (in US dollars).
-
Loan_amount - Loan amount (in US dollars) for which the application was submitted.
-
Term_months - Tenure of the loan (in months).
-
Credit_score - Whether the applicant's credit score was good ("Satisfactory") or not ("Not _satisfactory").
-
Age - The applicant’s age in years.
-
Sex - Whether the applicant is female (F) or male (M).
-
Purpose - Purpose of taking the loan as per the applicant (Education, Personal, Travel, Business, etc).
-
approval_status - Whether the loan application was approved ("1") or not ("0"). This is the dependent variable.
Let's start by loading the required libraries and the data.
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Load the Data
The first line of code below reads in the data as pandas dataframe, while the second line prints the shape - 600 observations of 11 variables. The third line prints the information about the variables and its type.
df = pd.read_csv("data_eng.csv")
print(df.shape)
print(df.info())
Output:
(600, 11)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 11 columns):
Marital_status 600 non-null object
Dependents 593 non-null float64
Is_graduate 594 non-null object
Income 600 non-null int64
Loan_amount 592 non-null float64
Term_months 593 non-null float64
Credit_score 600 non-null object
approval_status 600 non-null int64
Age 597 non-null float64
Sex 599 non-null object
Purpose 600 non-null object
dtypes: float64(4), int64(2), object(5)
memory usage: 51.6+ KB
None
The above output shows that there are five categorical (object) and six numerical variables - four float and two integers. The output also shows that few of the variables in the data have less than 600 observations, suggesting the presence of missing values.
Missing Values
Most of the machine learning algorithms find it difficult to work with the data that has missing values. The line of code below prints the number of missing values in the different variables.
df.apply(lambda x: sum(x.isnull()),axis=0)
Output:
Marital_status 0
Dependents 7
Is_graduate 6
Income 0
Loan_amount 8
Term_months 7
Credit_score 0
approval_status 0
Age 3
Sex 1
Purpose 0
dtype: int64
We will treat the missing values as per the variable type.
Missing Values for Continuous Variables
The most widely used method for treating missing values in numeric variables is by replacing them with the measures of central tendency. The three lines of code below replaces the missing values in the variables 'Loan_amount', 'Term_months', and 'Age', respectively, with the mean of these variables. The fourth line prints the number of missing values in the data. The output shows the corrections that have been made in these three variables.
df['Loan_amount'].fillna(df['Loan_amount'].mean(), inplace=True)
df['Term_months'].fillna(df['Term_months'].mean(), inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)
df.apply(lambda x: sum(x.isnull()),axis=0)
Output:
Marital_status 0
Dependents 7
Is_graduate 6
Income 0
Loan_amount 0
Term_months 0
Credit_score 0
approval_status 0
Age 0
Sex 1
Purpose 0
dtype: int64
Missing Values for Categorical Variables
One of the categorical variables, 'Is_graduate', has six missing values. Since this is a categorical variable, we will replace the missing values with the mode of the distribution. The line of code below displays the frequency count of the variable.
df['Is_graduate'].value_counts()
Output:
Yes 466
No 128
Name: Is_graduate, dtype: int64
The label 'Yes' has the highest frequency (or mode) for the variable 'Is_graduate', so we will replace the missing values with label 'Yes'. The first line of code below performs this task, while the second line prints the distribution of the variable. The output shows 600 records for the variable, which means the missing values have been replaced.
df['Is_graduate'].fillna('Yes',inplace=True)
df['Is_graduate'].value_counts()
Output:
Yes 472
No 128
Name: Is_graduate, dtype: int64
Deleting the Missing Records
The missing values of the variable 'Is_graduate' has been replaced. However, the data still has seven missing values in the variable 'Dependents' and one missing value in the variable 'Sex'.
Since the number of missing values is small, we will learn another approach for dropping records with missing values. The first line of code below uses the 'dropna()' function to drop rows with any missing values in it, while the second line checks the information about the dataset. The output shows that all the variables have the same number of observations and that the missing value imputation is complete.
df = df.dropna()
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 592 entries, 0 to 598
Data columns (total 11 columns):
Marital_status 592 non-null object
Dependents 592 non-null float64
Is_graduate 592 non-null object
Income 592 non-null int64
Loan_amount 592 non-null float64
Term_months 592 non-null float64
Credit_score 592 non-null object
approval_status 592 non-null int64
Age 592 non-null float64
Sex 592 non-null object
Purpose 592 non-null object
dtypes: float64(4), int64(2), object(5)
memory usage: 55.5+ KB
The data is free from missing values and we can now work on the feature engineering part. We will start with the categorical variables.
Feature Engineering for the Categorical Variables
Aggregating Categories
When a categorical variable has too many unique categories (or labels), it may limit the predictive power of the machine learning model. In such a scenario, limiting the count of these labels can be a solution. In our data, the variable 'Purpose' has several categories, which can be displayed using the line of code below.
df['Purpose'].value_counts()
Output:
Education 185
Personal 164
Travel 123
Business 43
Wedding 39
Furniture 38
Name: Purpose, dtype: int64
Notice that the categories 'Business', 'Wedding', and 'Furniture', have significantly lower counts compared to the other three categories. It is likely that all these categories will not have statistically significant difference in predicting 'approval_status'. Therefore, we will aggregate these three categories into a new category, called 'other'.
The first two lines of code below use a Python dictionary to re-code the number of 'Purpose' categories into a smaller number, while the third line prints the resulting frequency table.
purpose_categories = {'Education':'Education', 'Personal':'Personal',
'Travel':'Travel', 'Business':'other',
'Wedding':'other', 'Furniture':'other'}
df['Purpose'] = [purpose_categories[x] for x in df['Purpose']]
df['Purpose'].value_counts()
Output:
Education 185
Personal 164
Travel 123
other 120
Name: Purpose, dtype: int64
Encoding Categorical Variables
For modeling using scikit-learn, all the variables should be numeric, so we will have to change the labels. There are two common techniques of performing this.
One Hot Encoding
In this technique, the features are encoded using a one-hot ('dummy') encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array.
The first line of code below performs the one-hot encoding for the 'Sex' variable, while the second line prints the first five observations of the data.
df = pd.get_dummies(df, columns=['Sex'],prefix='S')
df.head(5)
Output:
| | Marital_status | Dependents | Is_graduate | Income | Loan_amount | Term_months | Credit_score | approval_status | Age | Purpose | S_F | S_M |
|--- |---------------- |------------ |------------- |-------- |------------- |------------- |------------------- |----------------- |------ |----------- |----- |----- |
| 0 | Yes | 2.0 | No | 816700 | 104000.0 | 365.079258 | Satisfactory | 0 | 55.0 | Education | 0 | 1 |
| 1 | No | 0.0 | Yes | 722300 | 79500.0 | 365.079258 | Not _satisfactory | 0 | 32.0 | Education | 1 | 0 |
| 2 | No | 0.0 | Yes | 666700 | 84000.0 | 365.079258 | Satisfactory | 0 | 59.0 | Education | 1 | 0 |
| 3 | No | 0.0 | Yes | 500000 | 63000.0 | 365.079258 | Satisfactory | 0 | 36.0 | Education | 0 | 1 |
| 4 | No | 0.0 | Yes | 409200 | 39500.0 | 504.000000 | Satisfactory | 0 | 42.0 | Education | 0 | 1 |
The output shows that the variable 'Sex' has been replaced by two binary variables 'S_F' and 'S_M'. However, there is repetition of the information as the presence of 'Female' category automatically means the absence of the 'Male' category. This duplication of information can be taken into consideration by a technique called Dummy Encoding.
Dummy Encoding
In this technique, the features are encoded so that there is no duplication of the information. It can be achieved by passing in the argument drop_first=True to the .get_dummies function, as done in the first four lines of code below. The fifth line of code drops the category 'S_M', while the sixth line prints the information about the data.
df = pd.get_dummies(df, columns=['Purpose'], drop_first=True, prefix='Pur')
df = pd.get_dummies(df, columns=['Marital_status'], drop_first=True, prefix='M')
df = pd.get_dummies(df, columns=['Is_graduate'], drop_first=True, prefix='Ed')
df = pd.get_dummies(df, columns=['Credit_score'], drop_first=True, prefix='CSc')
df = df.drop(['S_M'], axis=1)
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 592 entries, 0 to 598
Data columns (total 13 columns):
Dependents 592 non-null float64
Income 592 non-null int64
Loan_amount 592 non-null float64
Term_months 592 non-null float64
approval_status 592 non-null int64
Age 592 non-null float64
S_F 592 non-null uint8
Pur_Personal 592 non-null uint8
Pur_Travel 592 non-null uint8
Pur_other 592 non-null uint8
M_Yes 592 non-null uint8
Ed_Yes 592 non-null uint8
CSc_Satisfactory 592 non-null uint8
dtypes: float64(4), int64(2), uint8(7)
memory usage: 36.4 KB
Feature Engineering for the Numerical Variables
Feature Engineering for the numerical variables require a different strategy compared to the categorical features. The data has five numerical features - Dependents, Income, Loan_amount, Term_months, and Age. In the subsequent sections, we will learn about the various techniques of handling numerical variables.
Handling Extreme Values
Modeling numerical features can be difficult in the presence of extreme values, or outliers, which are observations that are significantly different from the other data points. For numerical variables, we can quantify the presence of outliers through the skewness value. The lines of code below print the skewness value along with the summary statistics for the 'Income' variable.
print(df['Income'].skew())
df['Income'].describe()
Output:
5.32574034245
count 5.920000e+02
mean 7.105902e+05
std 7.146531e+05
min 3.000000e+04
25% 3.865250e+05
50% 5.099500e+05
75% 7.713500e+05
max 8.444900e+06
Name: Income, dtype: float64
The output above shows that the variable 'Income' has a right-skewed distribution with the skewness value of 5.3. Ideally, the skewness value should be between -1 and 1. There are many techniques of handling these extreme values, one of which is quantile based capping or flooring. Since the variable 'Income' is right skewed, we will do the capping for the higher values.
The first line of code below prints the 90th percentile of the variable, while comes out to be USD 1,247,209. The second line of code is where the capping takes place, while the third line prints the skewness value post the capping.
print(df['Income'].quantile(0.90))
df["Income"] = np.where(df["Income"] >1247209, 1247209,df['Income'])
print(df['Income'].skew())
Output:
1247209.9999999998
0.894679717523
The skewness values is now 0.89 which is much better.
Creating New Features
Creating new features from two or more existing features can lead to an improvement in the predictive power of a machine learning algorithm. These new features are often referred to as interaction terms.
In the loan application data, the loan to income ratio can be an important metric for approving loan applicants. Another example could be the duration of the loan with respect to the loan amount. The lines of code below create two new features, 'loan_income_ratio' and 'loan_term_ratio', that take the interaction between the loan amount, the applicant's income and the loan term into consideration.
### Creating new features
df['loan_income_ratio'] = df['Loan_amount']/df['Income']
df['loan_term_ratio'] = df['Loan_amount']/df['Term_months']
Data Scaling and Transformations
Let us now look at the summary statistics of the variables which can be achieved with the line of code below.
df.describe()
Output:
| | Dependents | Income | Loan_amount | Term_months | approval_status | Age | S_F | Pur_Personal | Pur_Travel | Pur_other | M_Yes | Ed_Yes | CSc_Satisfactory | loan_income_ratio | loan_term_ratio |
|------- |------------ |-------------- |-------------- |------------- |----------------- |------------ |------------ |-------------- |------------ |------------ |------------ |------------ |------------------ |------------------- |----------------- |
| count | 592.000000 | 5.920000e+02 | 5.920000e+02 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 |
| mean | 0.753378 | 6.076007e+05 | 3.311038e+05 | 364.956884 | 0.690878 | 49.677365 | 0.182432 | 0.277027 | 0.207770 | 0.202703 | 0.653716 | 0.787162 | 0.788851 | 0.471830 | 1029.732920 |
| std | 1.029179 | 3.067529e+05 | 7.277030e+05 | 61.877355 | 0.462522 | 14.427056 | 0.386527 | 0.447908 | 0.406055 | 0.402353 | 0.476187 | 0.409660 | 0.408469 | 0.996969 | 2647.188471 |
| min | 0.000000 | 3.000000e+04 | 1.860000e+04 | 36.000000 | 0.000000 | 22.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.086564 | 48.437500 |
| 25% | 0.000000 | 3.865250e+05 | 6.200000e+04 | 384.000000 | 0.000000 | 37.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.122285 | 166.341146 |
| 50% | 0.000000 | 5.099500e+05 | 7.800000e+04 | 384.000000 | 1.000000 | 51.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.162164 | 217.604425 |
| 75% | 1.000000 | 7.713500e+05 | 1.415000e+05 | 384.000000 | 1.000000 | 61.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.243491 | 423.023897 |
| max | 6.000000 | 1.247209e+06 | 7.780000e+06 | 504.000000 | 1.000000 | 76.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 10.930828 | 32696.078431 |
We have seven numerical features - Dependents, Income, Loan_amount, Term_months, Age, loan_income_ratio, and loan_term_ratio. However, most of them have different units and scale, such as 'Age' in years and 'Income' in dollars. These differences can unduly influence the model and, therefore, we need to scale or transform the features. There are several techniques to achieve this, and some of the popular ones are discussed in the subsequent sections.
Min-max Scaling
In this approach, the data is scaled to a fixed range - usually 0 to 1. The impact is that we end up with smaller standard deviations, which can suppress the effect of outliers.
The first line of code below imports the 'MinMaxScaler' from the 'sklearn.preprocessing' module. The second line instantiates the 'MinMaxScaler()', while the third line fits the scaler on the Age column. The fourth line of code transforms the 'Age' column with the scaler we just fitted, while the fifth line prints the summary statistic of the 'scaled_age' variable. The output shows that the variable has been scaled between zero to one.
from sklearn.preprocessing import MinMaxScaler
minmaxscaler = MinMaxScaler()
minmaxscaler.fit(df[['Age']])
df['scaled_age'] = minmaxscaler.transform(df[['Age']])
df['scaled_age'].describe()
Output:
count 592.000000
mean 0.512544
std 0.267168
min 0.000000
25% 0.277778
50% 0.537037
75% 0.722222
max 1.000000
Name: scaled_age, dtype: float64
Standardization
Another technique is standardization, in which all the features are centered around zero and have, roughly, unit variance.
The first line of code below imports the 'StandardScaler' from the 'sklearn.preprocessing' module. The second line does the normalization for the four variables, 'Loan_amount', 'Term_months', 'loan_income_ratio', and 'loan_term_ratio'. Finally, the third line prints the variance of the standardized variables.
### Standardization
from sklearn.preprocessing import StandardScaler
df[['Loan_amount', 'Term_months', 'loan_income_ratio', 'loan_term_ratio']] = StandardScaler().fit_transform(df[['Loan_amount', 'Term_months','loan_income_ratio', 'loan_term_ratio']])
print(df['Loan_amount'].var()); print(df['Term_months'].var()); print(df['loan_income_ratio'].var()); print(df['loan_term_ratio'].var())
Output:
1.0016920473773263
1.0016920473773332
1.0016920473773272
1.0016920473773252
The output shows that all the standardized variables have unit variance.
Log Transformation
Earlier in the guide, we did quantile capping of the income variable. We could have also done logarithmic transformation to treat the extreme values. This can be achieved with the line of code below which creates a new feature, 'LogIncome'.
df["LogIncome"] = df["Income"].map(lambda i: np.log(i) if i > 0 else 0)
Binning Numeric Variables
One more feature engineering technique is that of binning the numeric variables, in which the numeric variable is grouped into categories, as per the distribution of the data. The line of code below prints the summary statistics of the variable 'Dependents'.
df['Dependents'].describe()
Output:
count 592.000000
mean 0.753378
std 1.029179
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 6.000000
Name: Dependents, dtype: float64
The range of the 'Dependents' variable is between zero to six, but the majority of the applicants have one dependents. We can group this variable into three bins which is done in the line of code below.
df['Binned_Dependents'] = pd.cut(df['Dependents'], bins=[-np.inf, 0, 2, np.inf], labels=[0, '1-2', '>2'])
We have done a lot of data preparation and feature engineering. As a result, certain variables are now redundant and should be dropped from the data. The first line of code below drops three such variables, while the second line prints the shape of the new data. The third line displays the first five observations of the data, which has 592 observations and 15 variables.
df_new = df.drop(['Dependents', 'Income', 'Age'], axis=1)
print(df_new.shape)
df_new.head(5)
Output:
(592, 15)
| | Loan_amount | Term_months | approval_status | S_F | Pur_Personal | Pur_Travel | Pur_other | M_Yes | Ed_Yes | CSc_Satisfactory | loan_income_ratio | loan_term_ratio | scaled_age | LogIncome | Binned_Dependents |
|--- |------------- |------------- |----------------- |----- |-------------- |------------ |----------- |------- |-------- |------------------ |------------------- |----------------- |------------ |----------- |------------------- |
| 0 | -0.312347 | 0.001979 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | -0.345828 | -0.281617 | 0.611111 | 13.613027 | 1-2 |
| 1 | -0.346043 | 0.001979 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -0.363171 | -0.306989 | 0.185185 | 13.490196 | 0 |
| 2 | -0.339854 | 0.001979 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | -0.347181 | -0.302329 | 0.685185 | 13.410095 | 0 |
| 3 | -0.368736 | 0.001979 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | -0.347174 | -0.324077 | 0.259259 | 13.122363 | 0 |
| 4 | -0.401057 | 2.248976 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | -0.376759 | -0.359689 | 0.370370 | 12.921959 | 0 |
Conclusion
In this guide, you have learned about the most popular data preparation and feature engineering techniques. The usage of these techniques depend on the data, the problem statement, and the machine learning algorithm selected for building the model.
To learn more about data preparation and building machine learning models using Python's 'scikit-learn' library, please refer to the following guides:
- Scikit Machine Learning
- Linear, Lasso, and Ridge Regression with scikit-learn
- Non-Linear Regression Trees with scikit-learn
- Machine Learning with Neural Networks Using scikit-learn
- Validating Machine Learning Models with scikit-learn
- Ensemble Modeling with scikit-learn
- Preparing Data for Modeling with scikit-learn
- Interpreting Data Using Descriptive Statistics with Python