Preparing Data for Modeling with R
Nov 21, 2019 • 17 Minute Read
Introduction
Data preparation often takes up to eighty percent of a data scientist's time in a data science project. This demonstrates its importance in the machine learning life cycle. In this guide, you will learn the basics and implementation of several data preparation techniques in R:
-
Dealing with Incorrect Entries
-
Missing Value Treatment
-
Encoding Categorical Labels
-
Handling Outliers
-
Logarithmic Transformation
-
Standardization
-
Converting Column Types
Data
In this guide, we will use a fictitious dataset of loan applicants that contains 600 observations and 10 variables, as described below:
-
Marital_status: Whether the applicant is married ("1") or not ("0")
-
Dependents: Number of dependents claimed by the applicant
-
Is_graduate: Whether the applicant is a graduate ("1") or not ("0")
-
Income: Annual Income of the applicant (in hundreds of dollars)
-
Loan_amount: Loan amount (in hundreds of dollars) for which the application was submitted
-
Term_months: Tenure of the loan (in months)
-
Credit_score: Whether the applicant's credit score was good ("1") or not ("0")
-
Age: The applicant’s age in years
-
Sex: Whether the applicant is female ("F") or male ("M")
-
Approval_status: Whether the loan application was approved ("1") or not ("0")
Let's start by loading the required libraries and the data.
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(ggplot2)
library(repr)
dat <- read_csv("data_prep.csv")
glimpse(dat)
Output:
Observations: 600
Variables: 10
$ Marital_status <int> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
$ Dependents <int> 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, ...
$ Is_graduate <int> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...
$ Income <int> 5635, 2708, 3333, 5324, 12677, 50292, 9523, 3685, 3107...
$ Loan_amount <int> 122, 126, 107, 230, 208, 169, 153, 122, 111, 107, 98, ...
$ Term_months <int> 384, 504, 504, 384, 384, 384, 384, 384, 384, 384, 384,...
$ Credit_score <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
$ approval_status <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, ...
$ Age <int> 0, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23,...
$ Sex <chr> "M", "F", "M", "M", "M", "M", "F", "F", "F", "M", "F",...
The output shows that the dataset has nine numerical variables (labeled as int) and one character variable (labeled as chr). We will analyze the data types in the subsequent section. For now, let's look at the summary of the data.
summary(dat)
Output:
Marital_status Dependents Is_graduate Income
Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. : 200
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 1.000 1st Qu.: 3832
Median :1.0000 Median :0.0000 Median : 1.000 Median : 5075
Mean :0.6517 Mean :0.7308 Mean : 2.449 Mean : 7211
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 1.000 3rd Qu.: 7642
Max. :1.0000 Max. :3.0000 Max. :999.000 Max. :108000
NA's :2 NA's :1
Loan_amount Term_months Credit_score approval_status
Min. : 10.0 Min. : 36.0 Min. :0.0000 Min. :0.0000
1st Qu.:111.0 1st Qu.:384.0 1st Qu.:1.0000 1st Qu.:0.0000
Median :140.0 Median :384.0 Median :1.0000 Median :1.0000
Mean :161.6 Mean :367.1 Mean :0.7883 Mean :0.6867
3rd Qu.:180.5 3rd Qu.:384.0 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :778.0 Max. :504.0 Max. :1.0000 Max. :1.0000
Age Sex
Min. : 0.00 Length:600
1st Qu.: 36.00 Class :character
Median : 51.00 Mode :character
Mean : 51.77
3rd Qu.: 64.00
Max. :200.00
Dealing with Incorrect Entries
The above output shows that the variable Age has minimum and maximum value of 0 and 200, respectively. Also, the variable Is_graduate has a maximum value of 999, instead of the binary values of 0 and 1. These entries are incorrect and need correction. One approach would be to delete these records, but instead, we will treat these records as missing values and replace them with a measure of central tendency—mean, median, or mode.
Starting with the Age variable, the first two lines of code below convert the incorrect values 0 and 200 to missing records. We repeat the same process for the variable Is_graduate in the third line of code. The fourth and fifth lines print the information about the variables.
dat[, 9][dat[, 9] == 0] <- NA
dat[, 9][dat[, 9] == 200] <- NA
dat[, 3][dat[, 3] == 999] <- NA
summary(dat$Age)
summary(dat$Is_graduate)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
22.00 36.00 50.50 50.61 64.00 80.00 6
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 1.0000 1.0000 0.7826 1.0000 1.0000 2
Now, the variables Age and Is_graduate have 594 and 598 records, respectively. The left-out entries have been tagged as missing, which we’ll handle in the next section.
Missing Value Treatment
Missing value imputation is one of the most common data preparation steps. It's easy to count the number of missing values in each variable using the sapply() function.
sapply(dat, function(x) sum(is.na(x)))
Output:
Marital_status Dependents Is_graduate Income Loan_amount
0 2 2 0 0
Term_months Credit_score approval_status Age Sex
0 0 0 6 5
There are several techniques for handling missing values. The most widely used is replacing the values with the measure of central tendency. The first line of code below replaces the missing values of the Age variable with the median of the remaining values. The second line replaces the missing values of the Is_graduate variable with the value of '1', which indicates that the applicant's education status is graduate. The last two lines print the summary statistics of the variables.
dat$Age[is.na(dat$Age)] <- median(dat$Age, na.rm = TRUE)
dat$Is_graduate[is.na(dat$Is_graduate)] <- 1
summary(dat$Age)
table(dat$Is_graduate)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max.
22.00 36.00 50.50 50.61 64.00 80.00
0 1
130 470
The missing value treatment is complete for both the variables. The data also has a variable, Sex, with five missing values. Since this is a categorical variable, we will check the distribution of labels, which is done in the line of code below.
table(dat$Sex)
Output:
F M
111 484
The output shows that 484 out of 595 applicants are male, so we will replace the missing values with the label M. The first line of code below performs this task, while the second line prints the distribution of the variable. The output shows 600 records for the Sex variable, indicating that the missing values have been imputed.
dat$Sex[is.na(dat$Sex)] <- "M"
table(dat$Sex)
Output:
Length Class Mode
600 character character
We will now check if there are any more missing values with the line of code below. The output shows that we still have two missing values in the variable Dependents.
sapply(dat, function(x) sum(is.na(x)))
Output:
Marital_status Dependents Is_graduate Income Loan_amount
0 2 0 0 0
Term_months Credit_score approval_status Age Sex
0 0 0 0 0
There are two missing values left in the data set, and we'll use another approach of treating missing values by dropping the records. The first line of code below uses the complete.cases() function to drop rows with any missing values in them, while the second line checks the information about the missing values in the data set. The third line prints the resulting dimension of the data.
dat = dat[complete.cases(dat), ]
sapply(dat, function(x) sum(is.na(x)))
dim(dat)
Output:
Marital_status Dependents Is_graduate Income Loan_amount
0 0 0 0 0
Term_months Credit_score approval_status Age Sex
0 0 0 0 0
LogIncome
0
[1] 598 10
Encoding Categorical Labels
Certain machine learning algorithms like xgboost require all variables to be numeric. In this data, the variable Sex has labels (M and F), so we will have to change them to a numeric format. Since there are two labels, we can do binary encoding, which is done in the first line of code below. The output from the second line shows that we have successfully performed the encoding.
dat$Sex <- ifelse(dat$Sex == "M",1,0)
table(dat$Sex)
Output:
0 1
111 487
Handling Outliers
One of the biggest obstacles in predictive modeling can be the presence of outliers, which are extreme values that are different from the other data points. Outliers are often a problem because they mislead the training process and lead to inaccurate models.
For numeric variables, we can identify outliers numerically through the skewness value. The lines of code below print the skewness value for the Income variable.
library(e1071)
skewness(dat$Income)
Output:
1] 6.455884
The output above shows that the variable Income has a right-skewed distribution with a skewness value of 6.5. Ideally, this value should be between -1 and 1. Apart from the income variable, we also have other variables, such as Loan_amount and Age, that have differences in scale and require normalization. We'll learn techniques in the next couple of sections to deal with this problem.
Logarithmic Transformation of Numerical Variables
One of the techniques to make a skewed distribution normal is logarithmic transformation. The first line of code below creates a new variable, LogIncome, while the second line computes and prints the skewness value of this new variable.
dat$LogIncome = log(dat$Income)
skewness(dat$LogIncome)
Output:
1] 0.4860717
The above chart shows that taking the log of the Income variable makes the distribution roughly normal and reduces the skewness value. We can use the same transformation for other numerical variables, but instead, we'll learn another transformation technique called standardization.
Standardization
Several machine learning algorithms use some form of a distance matrix to learn from the data. However, when the features are using different scales, such as Age in years and Income in hundreds of dollars, the features using larger scales can unduly influence the model. As a result, we want the features to use a similar scale that can be achieved through scaling techniques.
One such technique is standardization, in which all the features are centered around zero and have, roughly, unit variance. The first line of code below uses the mutate-at function from the dplyr library to perform the scaling. The result is that these variables are standardized with zero mean and unit variance.
dat_2 = dat %>% mutate_at(scale, .vars = vars("Income","Loan_amount", "Age"))
sd(dat_2$Income)
sd(dat_2$Loan_amount)
sd(dat_2$Age)
Output:
1] 1
[1] 1
[1] 1
Let's now look at the variables after all the steps we've implemented.
glimpse(dat_2)
Output:
Observations: 598
Variables: 11
$ Marital_status <int> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
$ Dependents <int> 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, ...
$ Is_graduate <dbl> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...
$ Income <dbl> -0.192200887, -0.547541541, -0.471665931, -0.229956590...
$ Loan_amount <dbl> -0.42074229, -0.37791245, -0.58135418, 0.73566336, 0.5...
$ Term_months <int> 384, 504, 504, 384, 384, 384, 384, 384, 384, 384, 384,...
$ Credit_score <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
$ approval_status <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, ...
$ Age <dbl> -0.003567868, -1.766094863, -1.766094863, -1.766094863...
$ Sex <dbl> 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, ...
$ LogIncome <dbl> 8.636752, 7.903966, 8.111628, 8.579980, 9.447545, 10.8...
Converting Column Types
The output above indicates that there are many variables for which the data type is incorrect. These variables are Marital_status, Is_graduate, Credit_score, approval_status, and Sex. These are categorical variables labeled as character variables, and they need to be converted to the factor type for modeling purposes. The lines of code below perform this task.
names <- c(1,3,7,8,10)
dat_2[,names] <- lapply(dat_2[,names] , factor)
glimpse(dat_2)
Output:
Observations: 598
Variables: 11
$ Marital_status <fct> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, ...
$ Dependents <int> 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, ...
$ Is_graduate <fct> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ...
$ Income <dbl> -0.192200887, -0.547541541, -0.471665931, -0.229956590...
$ Loan_amount <dbl> -0.42074229, -0.37791245, -0.58135418, 0.73566336, 0.5...
$ Term_months <int> 384, 504, 504, 384, 384, 384, 384, 384, 384, 384, 384,...
$ Credit_score <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
$ approval_status <fct> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, ...
$ Age <dbl> -0.003567868, -1.766094863, -1.766094863, -1.766094863...
$ Sex <fct> 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, ...
$ LogIncome <dbl> 8.636752, 7.903966, 8.111628, 8.579980, 9.447545, 10.8...
Conclusion
In this guide, you have learned about the fundamental techniques of data preparation for machine learning. You learned about dealing with missing values, identifying and treating outliers, normalizing and transforming data, and converting data types.
To learn more about data science using R, please refer to the following guides: