Non-Linear Regression Trees with R
Nov 18, 2019 • 16 Minute Read
Introduction
Regression is a supervised machine learning technique that predicts a continuous outcome. There are two types of regression algorithms: linear and nonlinear. While linear models are useful, they rely on the assumption of a linear relationship between the independent and dependent variables. This assumption is difficult to meet in real business scenarios. This is where non-linear regression algorithms come into picture that can capture non-linearity within the data.
In this guide, you'll learn how to implement non-linear regression trees using R.
Data
Unemployment is an important socio-economic and political concern for a country, and managing it is a major task for any government. In this guide, we will build non-linear regression algorithms for predicting unemployment within an economy.
The data comes from the US economic time series data available from https://research.stlouisfed.org/fred2. The data contains 574 rows and 5 variables, as described below:
-
psavert: personal savings rate
-
pce: personal consumption expenditures, in billions of dollars
-
uempmed: median duration of unemployment, in weeks
-
pop: total population, in millions
-
unemploy: number unemployed, in millions. This is the dependent variable.
Evaluation Metrics
We will evaluate the performance of the model using two metrics: R-squared value and Root Mean Squared Error (RMSE). Ideally, lower RMSE and higher R-squared values are indicative of a good model.
Let's start by loading the required libraries and the data.
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(ggplot2)
library(rpart)
library(rpart.plot)
library(randomForest)
dat <- read_csv("reg_data.csv")
glimpse(dat)
Output:
Observations: 574
Variables: 5
$ pce <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534.2, 544.9...
$ pop <dbl> 198.7, 198.9, 199.1, 199.3, 199.5, 199.7, 199.8, 199.9, 200.1...
$ psavert <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6, 12.2, 1...
$ uempmed <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4, 4.4, 4...
$ unemploy <dbl> 2.9, 2.9, 3.0, 3.1, 3.1, 3.0, 2.9, 3.0, 2.9, 2.7, 2.7, 2.9, 2...
The output shows that all the variables in the dataset are numerical variables (labeled as dbl). Let's look at the summary statistics of the data with the summary() function.
summary(dat)
Output:
pce pop psavert uempmed
Min. : 507.400 Min. :198.7000 Min. : 1.900000 Min. : 4.000000
1st Qu.: 1582.225 1st Qu.:224.8750 1st Qu.: 5.500000 1st Qu.: 6.000000
Median : 3953.550 Median :253.0500 Median : 7.700000 Median : 7.500000
Mean : 4843.510 Mean :257.1913 Mean : 7.936585 Mean : 8.610105
3rd Qu.: 7667.325 3rd Qu.:290.2500 3rd Qu.:10.500000 3rd Qu.: 9.100000
Max. :12161.500 Max. :320.9000 Max. :17.000000 Max. :25.200000
unemploy
Min. : 2.700000
1st Qu.: 6.300000
Median : 7.500000
Mean : 7.770383
3rd Qu.: 8.700000
Max. :15.400000
The output shows that there are no missing values in the data. It also indicates that the unit and magnitude of the variables is different, therefore scaling will be required.
Data Partitioning
We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation approach for evaluating model performance.
The first line of code below sets the random seed for reproducibility of results. The second line creates an index for randomly sampling observations for data partitioning. The next two lines of code create the training and test sets, while the last two lines print the dimensions of the training and test sets. The train set contains 70 percent of the data while the test set contains the remaining 30 percent.
set.seed(100)
index = sample(1:nrow(dat), 0.7*nrow(dat))
train = dat[index,] # Create the training data
test = dat[-index,] # Create the test data
dim(train)
dim(test)
Output:
1] 401 5
[1] 173 5
Scaling the Numeric Features
The numeric features need to be scaled, otherwise the modeling process may be adversely affected.
The first line of code below creates a list that contains the names of independent numeric variables. The second line uses the preProcess function from the caret package to complete the scaling task. The pre-processing object is fit only to the training data, while the scaling is applied on both the train and test sets. This is done in the third and fourth lines of code below. The fifth line prints the summary of the scaled train dataset. The output shows that now all the numeric features have a mean value of zero except the target variable, unemploy, which was not scaled.
cols = c('pce', 'pop', 'psavert', 'uempmed')
pre_proc_val <- preProcess(train[,cols], method = c("center", "scale"))
train[,cols] = predict(pre_proc_val, train[,cols])
test[,cols] = predict(pre_proc_val, test[,cols])
summary(train)
Output:
pce pop psavert uempmed unemploy
Min. :-1.2135 Min. :-1.6049 Min. :-1.89481 Min. :-1.1200 Min. : 2.700
1st Qu.:-0.8856 1st Qu.:-0.8606 1st Qu.:-0.76518 1st Qu.:-0.6198 1st Qu.: 6.500
Median :-0.2501 Median :-0.1416 Median :-0.05513 Median :-0.2447 Median : 7.500
Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 7.791
3rd Qu.: 0.7797 3rd Qu.: 0.9060 3rd Qu.: 0.81629 3rd Qu.: 0.1055 3rd Qu.: 8.600
Max. : 2.1244 Max. : 1.8047 Max. : 2.91417 Max. : 4.1571 Max. :15.300
Decision Tree
Decision trees, also referred to as Classification and Regression Trees (CART), work for both categorical and continuous input and output variables. They work by splitting the data into two or more homogeneous sets based on the most significant splitter among the independent variables. The best differentiation is the one that minimizes the cost metric. The cost metric for a classification tree is often the entropy or the gini index, whereas for a regression tree, the default metric is the mean squared error.
The basic workflow of decision trees is as follows:
The modeling process starts at the Root Node, which represents the entire data. This is divided into two or more sub-nodes, also referred to as splitting. This process of splitting continues until the splitting criterion is met, and the sub-node where splitting happens is called a decision node. Once the splitting criterion is met, the nodes do not split any further; such nodes are called a Leaf or Terminal node. We can also remove sub-nodes through a process called Pruning.
Training the Model
We will now create a regression tree model using the rpart library. The first step is to instantiate the algorithm, which is done in the first line of code below. The argument method = "anova" specifies that a regression model is to be built. For classification problem, the argument will be class. The control argument provides a list of options that control the way the decision tree algorithm is trained.
In the code below, the control=rpart.control(minsplit=20, cp=0.001) argument specifies that the minimum number of observations in a node should be 20 before attempting a split. It also specifies that a split happens only if it decreases the error metric by a factor of 0.001, referred to as the cost complexity factor. The second line prints the summary of the model. Note that it is possible to understand all the possible arguments with the ?rpart() command.
tree_model = rpart(unemploy ~ ., data=train, method = "anova", control=rpart.control(minsplit=20, cp=0.001))
summary(tree_model)
Output:
#Truncated Output
Call:
rpart(formula = unemploy ~ ., data = train, method = "anova",
control = rpart.control(minsplit = 20, cp = 0.001))
n= 401
CP nsplit rel error xerror xstd
1 0.508609424691 0 1.00000000000 1.00746991170 0.09222517909
2 0.240199126781 1 0.49139057531 0.53432047449 0.04138079775
3 0.062388054346 2 0.25119144853 0.29990084346 0.02550320375
4 0.027180308031 3 0.18880339418 0.21834245878 0.02083102075
5 0.024078055174 4 0.16162308615 0.20705841419 0.02021514091
6 0.019456798303 5 0.13754503098 0.17518743853 0.01711287778
7 0.015928262852 6 0.11808823267 0.16563982394 0.01633076066
8 0.014943330509 7 0.10215996982 0.15505479787 0.01516085859
9 0.012183802564 8 0.08721663931 0.13629795408 0.01491501765
10 0.007010774289 9 0.07503283675 0.12235179176 0.01517832910
11 0.006864114984 11 0.06101128817 0.11620014096 0.01353111243
12 0.004679462622 12 0.05414717319 0.11258856185 0.01342153785
13 0.002606235055 13 0.04946771057 0.09356044395 0.01191539603
14 0.002392463727 14 0.04686147551 0.08924901899 0.01192405697
15 0.002373022016 15 0.04446901178 0.08914954140 0.01192548795
16 0.001075173879 16 0.04209598977 0.08103175712 0.01180204685
17 0.001054136373 18 0.03994564201 0.07929979931 0.01179186081
18 0.001039626829 19 0.03889150564 0.07916127996 0.01179072614
19 0.001024169682 21 0.03681225198 0.07900442265 0.01178836902
20 0.001000000000 22 0.03578808230 0.07900442265 0.01178836902
Variable importance
pop uempmed pce psavert
33 32 28 7
The important variables, according to the regression tree model, are pop, uempmed and pce.
Non-Linear Regression Trees with R
Evaluating the Model
Let's evaluate the model performance on the train and test datasets. The first step is to create a function for calculating the evaluation metrics, R-squared and RMSE. The second step is to predict and evaluate the model on train data, while the third step is to predict and evaluate the model on test data.
# Step 1 - create the evaluation metrics function
eval_results <- function(true, predicted, df) {
SSE <- sum((predicted - true)^2)
SST <- sum((true - mean(true))^2)
R_square <- 1 - SSE / SST
RMSE = sqrt(SSE/nrow(df))
# Model performance metrics
data.frame(
RMSE = RMSE,
Rsquare = R_square
)
}
# Step 2 - predicting and evaluating the model on train data
predictions_train_cart = predict(tree_model, data = train)
eval_results(train$unemploy, predictions_train_cart, train)
# Step 3 - predicting and evaluating the model on test data
predictions_test_cart = predict(tree_model, newdata = test)
eval_results(test$unemploy, predictions_test_cart, test)
Output:
# Training set results
RMSE Rsquare
1 0.4644746 0.9642119"
# Test set results
RMSE Rsquare
1 0.6106443 0.9591839
The above output shows that the RMSE and R-squared values for the decision tree model on the training data are 0.47 million and 96.4 percent, respectively. For the test data, the results for these metrics are 0.61 million and 96 percent, respectively.
Random Forest (or Bootstrap Aggregation)
Decision Trees are useful, but the problem is that they often tend to overfit the training data, leading to high variances in the test data. Random forest algorithms overcome this shortcoming by reducing the variance of the decision trees. They are called a forest because they are the collection, or ensemble, of several decision trees. One major difference between a decision tree and a random forest model is how the splits happen. In random forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.
Training the Model
In R, the randomForest package is used to train the random forest algorithm. The first line of code below instantiates the random forest regression model, and the second line prints the summary of the model.
rf_model = randomForest(unemploy ~ ., data=train)
summary(rf_model)
Output:
Length Class Mode
call 3 -none- call
type 1 -none- character
predicted 401 -none- numeric
mse 500 -none- numeric
rsq 500 -none- numeric
oob.times 401 -none- numeric
importance 4 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 11 -none- list
coefs 0 -none- NULL
y 401 -none- numeric
test 0 -none- NULL
inbag 0 -none- NULL
terms 3 terms call
We can look at the important variables using the code below. The output shows that the variables pop, pce, and uempmed are the significant predictors.
importance(rf_model)
Output:
IncNodePurity
pce 695.0615302
pop 759.6686618
psavert 266.2619092
uempmed 676.7521072
Evaluating the Model
Let's evaluate the model performance on the train and test datasets, using the lines of code below.
# predicting and evaluating the model on train data
predictions_train_rf = predict(rf_model, data = train)
eval_results(train$unemploy, predictions_train_rf, train)
# predicting and evaluating the model on test data
predictions_test_rf = predict(rf_model, newdata = test)
eval_results(test$unemploy, predictions_test_rf, test)
Output:
# Output on Training Data
RMSE Rsquare
1 0.3469372 0.9800328
# Output on Test Data
RMSE Rsquare
1 0.511681 0.9713415
The above output shows that the RMSE and R-squared values on the training data are 0.35 million and 98 percent, respectively. For the test data, the results for these metrics are 0.51 million and 97.1 percent, respectively. The performance of the random forest model is superior to the decision tree model built earlier.
Conclusion
In this guide, you learned about tree-based non-linear regression models, including Decision tree and random forest. We observed that the random forest model outperforms the decision tree model. The results from these two algorithms are summarized below:
-
Decision Tree Model: Test set RMSE of 0.6 million and R-square of 96 percent.
-
Random Forest Model: Test set RMSE of 0.5 million and R-square of 97.1 percent.
To learn more about data science with R, please refer to the following guides: