Explore R Libraries: Rpart

By Deepika Singh

Jul 16, 2020 • 12 Minute Read

Introduction

Rpart is a powerful machine learning library in R that is used for building classification and regression trees. This library implements recursive partitioning and is very easy to use. In this guide, you will learn how to work with the rpart library in R.

Data

In this guide, you will use fictitious data of loan applicants containing 600 observations and eight variables, as described below:

Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")
Income: Annual Income of the applicant in USD
Loan_amount: Loan amount in USD for which the application was submitted
Credit_score: Whether the applicant's credit score is statisfactory ("Satisfactory") or not ("Not_Satisfactory")
approval_status: Whether the loan application was approved ("Yes") or not ("No")
Age: The applicant's age in years
Investment: Total investment in stocks and mutual funds in USD as declared by the applicant
Purpose: Purpose of applying for the loan

The first step is to load the required libraries and the data.

          library(plyr)
library(readr)
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)

dat <- read_csv("data.csv")

glimpse(dat)
    

Output:

          Observations: 600
Variables: 8
$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Income          <int> 3000, 3000, 3000, 3000, 8990, 13330, 13670, 13670, 173...
$ Loan_amount     <dbl> 6000, 9000, 9000, 9000, 8091, 11997, 12303, 12303, 155...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Not _...
$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
$ Age             <int> 27, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Investment      <dbl> 9331, 9569, 2100, 2100, 6293, 9331, 9569, 9569, 12124,...
$ Purpose         <chr> "Education", "Travel", "Others", "Others", "Travel", "...
    

The output shows that the dataset has four numerical (labelled as int) and four character variables (labelled as chr). You will convert these into factor variables using the line of code below.

          names <- c(1,4,5,8)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
    

Output:

          Observations: 600
Variables: 8
$ Is_graduate     <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
$ Income          <int> 3000, 3000, 3000, 3000, 8990, 13330, 13670, 13670, 173...
$ Loan_amount     <dbl> 6000, 9000, 9000, 9000, 8091, 11997, 12303, 12303, 155...
$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Not _satisfa...
$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
$ Age             <int> 27, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Investment      <dbl> 9331, 9569, 2100, 2100, 6293, 9331, 9569, 9569, 12124,...
$ Purpose         <fct> Education, Travel, Others, Others, Travel, Travel, Tra...
    

Data Partition

The createDataPartition function is used to split the data into training and test data. This is called the holdout-validation method for evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line performs the data partition, while the third and fourth lines create the training and test set. Finally, the fifth line prints the dimension of the training and test data.

          set.seed(100)
trainRowNumbers <- createDataPartition(dat$approval_status, p=0.7, list=FALSE)
train <- dat[trainRowNumbers,]
test <- dat[-trainRowNumbers,]
dim(train); dim(test)
    

Output:

          1] 420   8

[1] 180   8

Feature Scaling

The numeric features need to be scaled because the units of the variables differ significantly and may influence the modeling process. The first line of code below creates a list that contains the names of numeric variables. The second line uses the preProcess function from the caret library to complete this task. The method employed is of centering and scaling the numeric features, and the preprocessing object is fit only to the training data.

The scaling is applied on both the train and test data partitions, which is done in the third and fourth lines of code below. The fifth line prints the summary of the preprocessed train set. The output shows that now all the numeric features have a mean value of zero.

          cols = c('Income', 'Loan_amount', 'Age', 'Investment')

pre_proc_val <- preProcess(train[,cols], method = c("center", "scale"))

train[,cols] = predict(pre_proc_val, train[,cols])
test[,cols] = predict(pre_proc_val, test[,cols])

summary(train)

Output:

          Is_graduate     Income         Loan_amount                 Credit_score
 No : 90     Min.   :-1.3309   Min.   :-1.6568   Not _satisfactory: 97  
 Yes:330     1st Qu.:-0.5840   1st Qu.:-0.3821   Satisfactory     :323  
             Median :-0.3190   Median :-0.1459                          
             Mean   : 0.0000   Mean   : 0.0000                          
             3rd Qu.: 0.2341   3rd Qu.: 0.2778                          
             Max.   : 5.2695   Max.   : 3.7541                          
 approval_status      Age               Investment            Purpose   
 No :133         Min.   :-1.7607181   Min.   :-1.09348   Education: 76  
 Yes:287         1st Qu.:-0.8807620   1st Qu.:-0.60103   Home     :100  
                 Median :-0.0008058   Median :-0.28779   Others   : 45  
                 Mean   : 0.0000000   Mean   : 0.00000   Personal :113  
                 3rd Qu.: 0.8114614   3rd Qu.: 0.02928   Travel   : 86  
                 Max.   : 1.8944843   Max.   : 4.54891
    

Model Building with rpart

The data is ready for modeling and the next step is to build the classification decision tree. Start by setting the seed in the first line of code. The second line use the rpart function to specify the parameters used to control the model training process.

The important arguments of the rpart function are given below.

formula: a formula that links the target variable to the independent features.
data: the data to be used for modeling. In this case, you are building the model on training data.
method: defines the algorithm. It can be one of anova, poisson, class or exp. In this case, the target variables is categorical, so you will use the method as class.
minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted.
minbucket: the minimum number of observations in any terminal node. If only one of minbucket or minsplit is specified, the code either sets minsplit to minbucket*3 or minbucket to minsplit/3, as appropriate.

You will build the classification decision tree with the following argument:

          set.seed(100)

tree_model = rpart(approval_status ~ Is_graduate + Income + Loan_amount + Credit_score + Age + Investment + Purpose, data = train, method="class", minsplit = 10, minbucket=3)

You can examine the model with the command below.

      summary(tree_model)

Output:

          Call:
rpart(formula = approval_status ~ Is_graduate + Income + Loan_amount + 
    Credit_score + Age + Investment + Purpose, data = train, 
    method = "class", minsplit = 10, minbucket = 3)
  n= 420 

          CP nsplit rel error    xerror       xstd
1 0.60902256      0 1.0000000 1.0000000 0.07167876
2 0.06766917      1 0.3909774 0.3909774 0.05075155
3 0.01503759      3 0.2556391 0.2631579 0.04258808
4 0.01002506      6 0.2030075 0.2706767 0.04313607
5 0.01000000      9 0.1729323 0.2857143 0.04420254

Variable importance
     Purpose Credit_score          Age   Investment  Loan_amount       Income 
          46           31           17            2            2            1
    

An advantage of a decision tree is that you can actually visualize the model. This is done with the code below.

      prp(tree_model)

The above plot shows the important features used by the algorithm for classifying observations. The variables Purpose and Credit_score emerge as the most important variables for carrying out recursive partitioning.

Model Evaluation

You have built the algorithm on the training data and the next step is to evaluate its performance on the training and test dataset.

The code below predicts on training data, creates the confusion matrix, and finally computes the model accuracy.

          PredictCART_train = predict(tree_model, data = train, type = "class")

table(train$approval_status, PredictCART_train)

(131+266)/(131+266+23) #94.5%

Output:

          PredictCART_train
       No Yes
  No  131   2
  Yes  21 266
  
 (131+266)/(131+266+23) #94.5%
    

The accuracy on the training data is very good at 94.5%. The next step is to repeat the above step and check the model's accuracy on the test data.

          PredictCART = predict(tree_model, newdata = test, type = "class")

table(test$approval_status, PredictCART)

166/180 #92.2%

Output:

          PredictCART
       No Yes
  No   53   4
  Yes  10 113
  
  166/180  #92.2%
    

The output shows that the accuracy on the test data is 92%. Since the model performed well on both training and test data, it shows that the model is robust and its performance is good.

Conclusion

In this guide, you learned about the rpart library, which is one of the most powerful libraries in R to build non linear regression trees. You learned how to build and evaluate decision tree models, and also learned how to visualize the decision tree with the prp function.

To learn more about data science and machine learning with R, please refer to the following guides:

Deepika S.

Coming soon...

More about this author

Explore R Libraries: Rpart

Introduction

Data

Data Partition

Feature Scaling

Model Building with rpart

Model Evaluation

Conclusion

Advance your tech skills today