Building Classification Models in R
Classification models help predict whether a customer will churn, a bank loan will default, etc. Use R to build and train your logistic regression algorithm.
Nov 18, 2019 • 10 Minute Read
Introduction
Building classification models is one of the most important data science use cases. Classification models are models that predict a categorical label. A few examples of this include predicting whether a customer will churn or whether a bank loan will default. In this guide, you will learn how to build and evaluate a classification model in R. We will train the logistic regression algorithm, which is one of the oldest yet most powerful classification algorithms.
Data
In this guide, we will use a fictitious dataset of loan applicants containing 600 observations and 10 variables, as described below:
-
Marital_status: Whether the applicant is married ("Yes") or not ("No")
-
Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")
-
Income: Annual Income of the applicant (in USD)
-
Loan_amount: Loan amount (in USD) for which the application was submitted
-
Credit_score: Whether the applicant's credit score is good ("Good") or not ("Bad")
-
Approval_status: Whether the loan application was approved ("Yes") or not ("No")
-
Age: The applicant's age in years
-
Sex: Whether the applicant is a male ("M") or a female ("F")
-
Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant
-
Purpose: Purpose of applying for the loan
Let's start by loading the required libraries and the data.
library(plyr)
library(readr)
library(dplyr)
library(caret)
dat <- read_csv("data.csv")
glimpse(dat)
Output:
Observations: 600
Variables: 10
$ Marital_status <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye...
$ Is_graduate <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y...
$ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136700, 17320...
$ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123030, 15588...
$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satisfactory", ...
$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "Yes", "No"...
$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 30, ...
$ Sex <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M", "F", "F",...
$ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690, 121240, ...
$ Purpose <chr> "Education", "Travel", "Others", "Others", "Travel", "Travel", "...
The output shows that the dataset has four numerical (labeled as int) and six character variables (labeled as chr). We will convert these into factor variables using the line of code below.
names <- c(1,2,5,6,8,10)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
Output:
Observations: 600
Variables: 10
$ Marital_status <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes...
$ Is_graduate <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Y...
$ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136700, 17320...
$ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123030, 15588...
$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory, Satisfac...
$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, Yes, No, Y...
$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 30, ...
$ Sex <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, M, M, M, M...
$ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690, 121240, ...
$ Purpose <fct> Education, Travel, Others, Others, Travel, Travel, Travel, Educa...
Data Partitioning
We will build our model on the training dataset and evaluate its performance on the test dataset. This is called the holdout-validation approach to evaluating model performance.
The first line of code below sets the random seed for reproducibility of results. The second line loads the caTools package that will be used for data partitioning, while the third to fifth lines create the training and test datasets. The train dataset contains 70 percent of the data (420 observations of 10 variables) while the test data contains the remaining 30 percent (180 observations of 10 variables).
set.seed(100)
library(caTools)
spl = sample.split(dat$approval_status, SplitRatio = 0.7)
train = subset(dat, spl==TRUE)
test = subset(dat, spl==FALSE)
print(dim(train)); print(dim(test))
Output:
1] 420 10
[1] 180 10
Build, Predict and Evaluate the Model
To fit the logistic regression model, the first step is to instantiate the algorithm. This is done in the first line of code below with the glm() function. The second line prints the summary of the trained model.
model_glm = glm(approval_status ~ . , family="binomial", data = train)
summary(model_glm)
Output:
Call:
glm(formula = approval_status ~ ., family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.19539 -0.00004 0.00004 0.00008 2.47763
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.238e-02 9.052e+03 0.000 1.0000
Marital_statusYes 4.757e-01 4.682e-01 1.016 0.3096
Is_graduateYes 5.647e-01 4.548e-01 1.242 0.2144
Income 2.244e-06 1.018e-06 2.204 0.0275 *
Loan_amount -3.081e-07 3.550e-07 -0.868 0.3854
Credit_scoreSatisfactory 2.364e+01 8.839e+03 0.003 0.9979
Age -7.985e-02 1.360e-02 -5.870 4.35e-09 ***
SexM -5.879e-01 6.482e-01 -0.907 0.3644
Investment -2.595e-06 1.476e-06 -1.758 0.0787 .
PurposeHome 2.599e+00 9.052e+03 0.000 0.9998
PurposeOthers -4.172e+01 3.039e+03 -0.014 0.9890
PurposePersonal 1.577e+00 2.503e+03 0.001 0.9995
PurposeTravel -1.986e+01 1.954e+03 -0.010 0.9919
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 524.44 on 419 degrees of freedom
Residual deviance: 166.96 on 407 degrees of freedom
AIC: 192.96
Number of Fisher Scoring iterations: 19
The significance code ‘***’ in the above output shows the relative importance of the feature variables. Let's evaluate the model further, starting by setting the baseline accuracy using the code below. Since the majority class of the target variable has a proportion of 0.68, the baseline accuracy is 68 percent.
#Baseline Accuracy
prop.table(table(train$approval_status))
Output:
No Yes
0.3166667 0.6833333
Let's now evaluate the model performance on the training and test data, which should ideally be better than the baseline accuracy. We start by generating predictions on the training data, using the first line of code below. The second line creates the confusion matrix with a threshold of 0.5, which means that for probability predictions equal to or greater than 0.5, the algorithm will predict the Yes response for the approval_status variable. The third line prints the accuracy of the model on the training data, using the confusion matrix, and the accuracy comes out to be 91 percent.
We then repeat this process on the test data, and the accuracy comes out to be 88 percent.
# Predictions on the training set
predictTrain = predict(model_glm, data = train, type = "response")
# Confusion matrix on training data
table(train$approval_status, predictTrain >= 0.5)
(114+268)/nrow(train) #Accuracy - 91%
#Predictions on the test set
predictTest = predict(model_glm, newdata = test, type = "response")
# Confusion matrix on test set
table(test$approval_status, predictTest >= 0.5)
158/nrow(test) #Accuracy - 88%
Output:
# Confusion matrix and accuracy on training data
FALSE TRUE
No 114 19
Yes 19 268
[1] 0.9095238
# Confusion matrix and accuracy on testing data
FALSE TRUE
No 44 13
Yes 9 114
[1] 0.8777778
Conclusion
In this guide, you have learned techniques of building a classification model in R using the powerful logistic regression algorithm. The baseline accuracy for the data was 68 percent, while the accuracy on the training and test data was 91 percent, and 88 percent, respectively. Overall, the logistic regression model is beating the baseline accuracy by a big margin on both the train and test datasets, and the results are very good.
To learn more about data science using R, please refer to the following guides: