Explore R Libraries: CARET
Jun 26, 2020 • 11 Minute Read
Introduction
R is a powerful programming language for data science that provides a wide number of libraries for machine learning. One of the most powerful and popular packages is the caret library, which follows a consistent syntax for data preparation, model building, and model evaluation, making it easy for data science practitioners.
Caret stands for classification and regression training and is arguably the biggest project in R. This package is sufficient to solve almost any classification or regression machine learning problem. It supports approximately 200 machine learning algorithms and makes it easy to perform critical tasks such as data preparation, data cleaning, feature selection, and model validation.
In this guide, you will learn how to work with the caret library in R.
Data
In this guide, you will use a fictitious dataset of loan applicants containing 600 observations and 8 variables, as described below:
-
Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")
-
Income: Annual Income of the applicant (in USD)
-
Loan_amount: Loan amount (in USD) for which the application was submitted
-
Credit_score: Whether the applicant's credit score is satisfactory ("Satisfactory") or not ("Not_Satisfactory")
-
approval_status: Whether the loan application was approved ("Yes") or not ("No")
-
Age: The applicant's age in years
-
Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant
-
Purpose: Purpose of applying for the loan
The first step is to load the required libraries and the data.
library(caret)
library(plyr)
library(readr)
library(dplyr)
library(ROSE)
dat <- read_csv("data.csv")
glimpse(dat)
Output:
Observations: 600
Variables: 8
$ Is_graduate <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Income <int> 3000, 3000, 3000, 3000, 8990, 13330, 13670, 13670, 173...
$ Loan_amount <dbl> 6000, 9000, 9000, 9000, 8091, 11997, 12303, 12303, 155...
$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Not _...
$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
$ Age <int> 27, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Investment <dbl> 9331, 9569, 2100, 2100, 6293, 9331, 9569, 9569, 12124,...
$ Purpose <chr> "Education", "Travel", "Others", "Others", "Travel", "...
The output shows that the dataset has four numerical and four character variables. Convert these into factor variables using the line of code below.
names <- c(1,4,5,8)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
Output:
Observations: 600
Variables: 8
$ Is_graduate <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
$ Income <int> 3000, 3000, 3000, 3000, 8990, 13330, 13670, 13670, 173...
$ Loan_amount <dbl> 6000, 9000, 9000, 9000, 8091, 11997, 12303, 12303, 155...
$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Not _satisfa...
$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
$ Age <int> 27, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Investment <dbl> 9331, 9569, 2100, 2100, 6293, 9331, 9569, 9569, 12124,...
$ Purpose <fct> Education, Travel, Others, Others, Travel, Travel, Tra...
Data Partition
The createDataPartition function is extremely useful for splitting the data into training and test datasets. This data partition is required because you will build the model on the training set and evaluate its performance on the test set. This is called the holdout-validation method for evaluating model performance.
The first line of code below sets the random seed for reproducibility of results. The second line performs the data partition, while the third and fourth lines create the training and test sets. The training set contains 70 percent of the data (420 observations of 10 variables) and the test set contains the remaining 30 percent (180 observations of 10 variables).
set.seed(100)
trainRowNumbers <- createDataPartition(dat$approval_status, p=0.7, list=FALSE)
train <- dat[trainRowNumbers,]
test <- dat[-trainRowNumbers,]
dim(train); dim(test)
Output:
1] 420 8
[1] 180 8
Feature Scaling
The numeric features need to be scaled because the units of the variables differ significantly and may influence the modeling process. The first line of code below creates a list that contains the names of numeric variables. The second line uses the preProcess function from the caret library to complete the task. The method is to center and scale the numeric features, and the pre-processing object is fit only to the training data.
The third and fourth lines of code apply the scaling to both the train and test data partitions. The fifth line prints the summary of the preprocessed train set. The output shows that now all the numeric features have a mean value of zero.
cols = c('Income', 'Loan_amount', 'Age', 'Investment')
pre_proc_val <- preProcess(train[,cols], method = c("center", "scale"))
train[,cols] = predict(pre_proc_val, train[,cols])
test[,cols] = predict(pre_proc_val, test[,cols])
summary(train)
Output:
Is_graduate Income Loan_amount Credit_score
No : 90 Min. :-1.3309 Min. :-1.6568 Not _satisfactory: 97
Yes:330 1st Qu.:-0.5840 1st Qu.:-0.3821 Satisfactory :323
Median :-0.3190 Median :-0.1459
Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.2341 3rd Qu.: 0.2778
Max. : 5.2695 Max. : 3.7541
approval_status Age Investment Purpose
No :133 Min. :-1.7607181 Min. :-1.09348 Education: 76
Yes:287 1st Qu.:-0.8807620 1st Qu.:-0.60103 Home :100
Median :-0.0008058 Median :-0.28779 Others : 45
Mean : 0.0000000 Mean : 0.00000 Personal :113
3rd Qu.: 0.8114614 3rd Qu.: 0.02928 Travel : 86
Max. : 1.8944843 Max. : 4.54891
Model Building
There are several machine learning models available in caret. You can have a look at these models with the code below.
available_models <- paste(names(getModelInfo()), collapse=', ')
available_models
Output:
1] "ada
The next step is to build the random forest algorithm. Start by setting the seed in the first line of code below. The second line specifies the parameters used to control the model training process. This is done with the trainControl function.
The third line trains the random forest algorithm specified by the argument method="rf". Accuracy is selected as the evaluation criteria.
set.seed(100)
control1 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)
rf_model <- train(approval_status ~., data=train, method="rf", metric="Accuracy", trControl=control1)
You can examine the model with the command below.
rf_model
Output:
Random Forest
420 samples
7 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 336, 336, 336, 336, 336, 336, ...
Addtional sampling using ROSE
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8799087 0.7300565
6 0.7163380 0.4289620
10 0.6675567 0.3352061
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
Model Evaluation
After building the algorithm on the training data, the next step is to evaluate its performance on the test dataset. The lines of code below generate predictions on the test set and print the confusion matrix.
predictTest = predict(rf_model, newdata = test, type = "raw")
table(test$approval_status, predictTest)
Output:
predictTest
No Yes
No 56 1
Yes 10 113
The accuracy can be calculated from the confusion matrix with the code below.
(113+56)/nrow(test)
Output:
1] 0.9388889
The output shows that the accuracy is 94%, which indicates that the model performed well.
Conclusion
In this guide, you learned about the caret library, which is one of the most powerful packages in R. You also learned how to scale features, create data partitions, and train and evaluate machine learning algorithms.
To learn more about data science and machine learning with R, please refer to the following guides: