Linear, Lasso, and Ridge Regression with R
Nov 12, 2019 • 20 Minute Read
Introduction
Machine learning is used by many organizations to identify and solve business problems. The two types of supervised machine learning algorithms are classification and regression. This guide will focus on regression models that predict a continuous outcome. You'll learn how to implement linear and regularized regression models using R.
The topics we'll cover include:
-
Linear Regression
-
Ridge Regression
-
Lasso Regression
-
Elastic Net Regression
Let’s start by looking at a real-life situation and data.
Data
Unemployment is a critical socio-economic and political concern for any country, and hence, managing it is a chief task for any government. In this guide, we will build regression algorithms for predicting unemployment within an economy.
The data comes from US economic time series data available from https://research.stlouisfed.org/fred2. The data contains 574 rows and 5 variables, as described below:
-
psavert: personal savings rate
-
pce: personal consumption expenditures, in billions of dollars
-
uempmed: median duration of unemployment, in weeks
-
pop: total population, in million s
-
unemploy: number unemployed, in millions. This is the dependent variable.
Evaluation Metrics
We will evaluate the performance of the model using two metrics: R-squared value and Root Mean Squared Error (RMSE). Ideally, lower RMSE and higher R-squared values are indicative of a good model.
Let's start by loading the required libraries and the data.
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(ggplot2)
library(repr)
dat <- read_csv("reg_data.csv")
glimpse(dat)
Output:
Observations: 574
Variables: 5
$ pce <dbl> 507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534.2, 544.9, 544.6, 5...
$ pop <dbl> 198.7, 198.9, 199.1, 199.3, 199.5, 199.7, 199.8, 199.9, 200.1, 200.2, 2...
$ psavert <dbl> 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6, 12.2, 12.0, 11.6,...
$ uempmed <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4, 4.4, 4.5, 4.2, 4...
$ unemploy <dbl> 2.9, 2.9, 3.0, 3.1, 3.1, 3.0, 2.9, 3.0, 2.9, 2.7, 2.7, 2.9, 2.9, 2.8, 2...
The output shows that all the variables in the dataset are numerical variables (labeled as 'dbl').
Data Partitioning
We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation approach for evaluating model performance.
The first line of code below sets the random seed for reproducibility of results. The second line creates an index for randomly sampling observations for data partitioning. The next two lines of code create the training and test set, while the last two lines print the dimensions of the training and test set. The train set contains 70 percent of the data while the test set contains the remaining 30 percent.
set.seed(100)
index = sample(1:nrow(dat), 0.7*nrow(dat))
train = dat[index,] # Create the training data
test = dat[-index,] # Create the test data
dim(train)
dim(test)
Output:
1] 401 5
[1] 173 5
Scaling the Numeric Features
The numeric features need to be scaled; otherwise, they may adversely influence the modeling process. The first line of code below creates a list that contains the names of independent numeric variables. The second line uses the preProcess function from the caret package to complete the scaling task. The pre-processing object is fit only to the training data, while the scaling is applied on both the train and test sets. This is done in the third and fourth lines of code below. The fifth line prints the summary of the scaled train dataset.
The output shows that now all the numeric features have a mean value of zero except the target variable, unemploy, which was not scaled.
cols = c('pce', 'pop', 'psavert', 'uempmed')
pre_proc_val <- preProcess(train[,cols], method = c("center", "scale"))
train[,cols] = predict(pre_proc_val, train[,cols])
test[,cols] = predict(pre_proc_val, test[,cols])
summary(train)
Output:
pce pop psavert uempmed unemploy
Min. :-1.2135 Min. :-1.6049 Min. :-1.89481 Min. :-1.1200 Min. : 2.700
1st Qu.:-0.8856 1st Qu.:-0.8606 1st Qu.:-0.76518 1st Qu.:-0.6198 1st Qu.: 6.500
Median :-0.2501 Median :-0.1416 Median :-0.05513 Median :-0.2447 Median : 7.500
Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 7.791
3rd Qu.: 0.7797 3rd Qu.: 0.9060 3rd Qu.: 0.81629 3rd Qu.: 0.1055 3rd Qu.: 8.600
Max. : 2.1244 Max. : 1.8047 Max. : 2.91417 Max. : 4.1571 Max. :15.300
Linear Regression
The simplest form of regression is linear regression, which assumes that the predictors have a linear relationship with the target variable. The input variables are assumed to have a Gaussian distribution and are not correlated with each other (a problem called multi-collinearity).
The linear regression equation can be expressed in the following form: y = a1x1 + a2x2 + a3x3 + ..... + anxn + b
In the above equation:
- y is the target variable.
- x1, x2, x3, ... xn are the features.
- a1, a2, a3, ... an are the coefficients.
- b is the parameter of the model.
The parameters a and b in the model are selected through the ordinary least squares (OLS) method. This method works by minimizing the sum of squares of residuals (actual value - predicted value).
In order to fit the linear regression model, the first step is to instantiate the algorithm in the first line of code below using the lm() function. The second line prints the summary of the trained model.
lr = lm(unemploy ~ uempmed + psavert + pop + pce, data = train)
summary(lr)
Output:
Call:
lm(formula = unemploy ~ uempmed + psavert + pop + pce, data = train)
Residuals:
Min 1Q Median 3Q Max
-2.4262 -0.7253 0.0278 0.6697 3.2753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.79077 0.04712 165.352 < 2e-16 ***
uempmed 2.18021 0.08588 25.386 < 2e-16 ***
psavert 0.79126 0.13244 5.975 5.14e-09 ***
pop 5.95419 0.37405 15.918 < 2e-16 ***
pce -5.31578 0.32753 -16.230 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9435 on 396 degrees of freedom
Multiple R-squared: 0.8542, Adjusted R-squared: 0.8527
F-statistic: 579.9 on 4 and 396 DF, p-value: < 2.2e-16
The significance code ‘***’ in above output shows that all the features are important predictors. The Adjusted R-squared value of 0.8527 is also a good result. Let's evaluate the model further.
Model Evaluation Metrics
The first step is to create a function for calculating the evaluation metrics R-squared and RMSE. The second step is to predict and evaluate the model on train data, while the third step is to predict and evaluate the model on test data.
#Step 1 - create the evaluation metrics function
eval_metrics = function(model, df, predictions, target){
resids = df[,target] - predictions
resids2 = resids**2
N = length(predictions)
r2 = as.character(round(summary(model)$r.squared, 2))
adj_r2 = as.character(round(summary(model)$adj.r.squared, 2))
print(adj_r2) #Adjusted R-squared
print(as.character(round(sqrt(sum(resids2)/N), 2))) #RMSE
}
# Step 2 - predicting and evaluating the model on train data
predictions = predict(lr, newdata = train)
eval_metrics(lr, train, predictions, target = 'unemploy')
# Step 3 - predicting and evaluating the model on test data
predictions = predict(lr, newdata = test)
eval_metrics(lr, test, predictions, target = 'unemploy')
Output:
1] "0.85"
[1] "0.94"
[1] "0.85"
[1] "1.1"
The above output shows that RMSE, one of the two evaluation metrics, is 0.94 million for train data and 1.1 million for test data. On the other hand, R-squared value is around 85 percent for both train and test data, which indicates good performance.
Regularization
Linear regression algorithm works by selecting coefficients for each independent variable that minimizes a loss function. However, if the coefficients are large, they can lead to over-fitting on the training dataset, and such a model will not generalize well on the unseen test data. To overcome this shortcoming, we'll do regularization, which penalizes large coefficients. The following sections of the guide will discuss various regularization algorithms.
We will be using the glmnet() package to build the regularized regression models. The glmnet function does not work with dataframes, so we need to create a numeric matrix for the training features and a vector of target values.
The lines of code below perform the task of creating model matrix using the dummyVars function from the caret package. The predict function is then applied to create numeric model matrices for training and test.
cols_reg = c('pce', 'pop', 'psavert', 'uempmed', 'unemploy')
dummies <- dummyVars(unemploy ~ ., data = dat[,cols_reg])
train_dummies = predict(dummies, newdata = train[,cols_reg])
test_dummies = predict(dummies, newdata = test[,cols_reg])
print(dim(train_dummies)); print(dim(test_dummies))
Output:
1] 401 4
[1] 173 4
Ridge Regression
Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model. This modification is done by adding a penalty parameter that is equivalent to the square of the magnitude of the coefficients.
Loss function = OLS + alpha * summation (squared coefficient values)
Ridge regression is also referred to as l2 regularization. The lines of code below construct a ridge regression model. The first line loads the library, while the next two lines create the training data matrices for the independent (x) and dependent variables (y).
The same step is repeated for the test dataset in the fourth and fifth lines of code. The sixth line creates a list of lambda values for the model to try, while the seventh line builds the ridge regression model.
The arguments used in the model are:
-
nlambda: determines the number of regularization parameters to be tested.
-
alpha: determines the weighting to be used. In case of ridge regression, the value of alpha is zero.
-
family: determines the distribution family to be used. Since this is a regression model, we will use the Gaussian distribution.
-
lambda: determines the lambda values to be tried.
The last line of code prints the model information.
library(glmnet)
x = as.matrix(train_dummies)
y_train = train$unemploy
x_test = as.matrix(test_dummies)
y_test = test$unemploy
lambdas <- 10^seq(2, -3, by = -.1)
ridge_reg = glmnet(x, y_train, nlambda = 25, alpha = 0, family = 'gaussian', lambda = lambdas)
summary(ridge_reg)
Output:
Length Class Mode
a0 51 -none- numeric
beta 204 dgCMatrix S4
df 51 -none- numeric
dim 2 -none- numeric
lambda 51 -none- numeric
dev.ratio 51 -none- numeric
nulldev 1 -none- numeric
npasses 1 -none- numeric
jerr 1 -none- numeric
offset 1 -none- logical
call 7 -none- call
nobs 1 -none- numeric
One of the major differences between linear and regularized regression models is that the latter involves tuning a hyperparameter, lambda. The code above runs the glmnet() model several times for different values of lambda. We can automate this task of finding the optimal lambda value using the cv.glmnet() function. This is performed using the lines of code below.
cv_ridge <- cv.glmnet(x, y_train, alpha = 0, lambda = lambdas)
optimal_lambda <- cv_ridge$lambda.min
optimal_lambda
Output:
1] 0.001
The optimal lambda value comes out to be 0.001 and will be used to build the ridge regression model. We will also create a function for calculating and printing the results, which is done with the eval_results() function in the code below. The next step is to use the predict function to generate predictions on the train and test data. Finally, we use the eval_results function to calculate and print the evaluation metrics.
# Compute R^2 from true and predicted values
eval_results <- function(true, predicted, df) {
SSE <- sum((predicted - true)^2)
SST <- sum((true - mean(true))^2)
R_square <- 1 - SSE / SST
RMSE = sqrt(SSE/nrow(df))
# Model performance metrics
data.frame(
RMSE = RMSE,
Rsquare = R_square
)
}
# Prediction and evaluation on train data
predictions_train <- predict(ridge_reg, s = optimal_lambda, newx = x)
eval_results(y_train, predictions_train, train)
# Prediction and evaluation on test data
predictions_test <- predict(ridge_reg, s = optimal_lambda, newx = x_test)
eval_results(y_test, predictions_test, test)
Output:
RMSE Rsquare
1 0.9383428 0.8539379
RMSE Rsquare
1 1.102452 0.8669624
The above output shows that the RMSE and R-squared values for the ridge regression model on the training data are 0.93 million and 85.4 percent, respectively. For the test data, the results for these metrics are 1.1 million and 86.7 percent, respectively. There is an improvement in the performance compared with linear regression model.
Lasso Regression
Lasso regression, or the Least Absolute Shrinkage and Selection Operator, is also a modification of linear regression. In lasso, the loss function is modified to minimize the complexity of the model by limiting the sum of the absolute values of the model coefficients (also called the l1-norm).
The loss function for lasso regression can be expressed as below:
Loss function = OLS + alpha * summation (absolute values of the magnitude of the coefficients)
In the above function, alpha is the penalty parameter we need to select. Using an l1-norm constraint forces some weight values to zero to allow other coefficients to take non-zero values.
The first step to build a lasso model is to find the optimal lambda value using the code below. For lasso regression, the alpha value is 1. The output is the best cross-validated lambda, which comes out to be 0.001.
lambdas <- 10^seq(2, -3, by = -.1)
# Setting alpha = 1 implements lasso regression
lasso_reg <- cv.glmnet(x, y_train, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 5)
# Best
lambda_best <- lasso_reg$lambda.min
lambda_best
Output:
1] 0.001
Once we have the optimal lambda value, we train the lasso model in the first line of code below. The second through fifth lines of code generate the predictions and print the evaluation metrics for both the training and test datasets.
lasso_model <- glmnet(x, y_train, alpha = 1, lambda = lambda_best, standardize = TRUE)
predictions_train <- predict(lasso_model, s = lambda_best, newx = x)
eval_results(y_train, predictions_train, train)
predictions_test <- predict(lasso_model, s = lambda_best, newx = x_test)
eval_results(y_test, predictions_test, test)
Output:
RMSE Rsquare
1 0.9378347 0.854096
RMSE Rsquare
1 1.099764 0.8676104
The above output shows that the RMSE and R-squared values on the training data are 0.93 million and 85.4 percent, respectively. The results on the test data are 1.1 million and 86.7 percent, respectively. Lasso regression can also be used for feature selection because the coefficients of less important features are reduced to zero.
Elastic Net Regression
Elastic net regression combines the properties of ridge and lasso regression. It works by penalizing the model using both the 1l2-norm1 and the 1l1-norm1. The model can be easily built using the caret package, which automatically selects the optimal value of parameters alpha and lambda.
The first line of code creates the training control object train_cont which specifies how the repeated cross validation will take place. The second line builds the elastic regression model in which a range of possible alpha and lambda values are tested and their optimum value is selected. The argument tuneLength specifies that 10 different combinations of values for alpha and lambda are to be tested.
The last line of code prints the optimum values, which come out to be 0.4322 for alpha and 0.0065 for lambda.
# Set training control
train_cont <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
search = "random",
verboseIter = TRUE)
# Train the model
elastic_reg <- train(unemploy ~ .,
data = train,
method = "glmnet",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = train_cont)
# Best tuning parameter
elastic_reg$bestTune
Output:
alpha lambda
2 0.1996298 0.001183022
Once we have trained the model, we use it to generate the predictions and print the evaluation results for both the training and test datasets, using the lines of code below.
# Make predictions on training set
predictions_train <- predict(elastic_reg, x)
eval_results(y_train, predictions_train, train)
# Make predictions on test set
predictions_test <- predict(elastic_reg, x_test)
eval_results(y_test, predictions_test, test)
Output:
RMSE Rsquare
1 0.9495596 0.850425
RMSE Rsquare
1 1.128873 0.8605094
The above output shows that the RMSE and R-squared values for the elastic net regression model on the training data are 0.95 million and 85 percent, respectively. The results for these metrics on the test data are 1.12 million and 86 percent, respectively.
Conclusion
In this guide, you have learned about linear regression models using the powerful R language. You also learned about regularization techniques to avoid the shortcomings of the linear regression models.
The performance of the models is summarized below:
-
Linear Regression Model: Test set RMSE of 1.1 million and R-square of 85 percent.
-
Ridge Regression Model: Test set RMSE of 1.1 million and R-square of 86.7 percent.
-
Lasso Regression Model: Test set RMSE of 1.09 million and R-square of 86.7 percent.
-
ElasticNet Regression Model: Test set RMSE of 1.12 million and R-square of 86 percent.
The regularized regression models are performing better than the linear regression model. Overall, all the models are performing well with decent R-squared and stable RMSE values. The most ideal result would be an RMSE value of zero and R-squared value of 1, but that's almost impossible in real economic datasets.
To learn more about data science using R, please refer to the following guides:
- Interpreting Data Using Descriptive Statistics with R
- Interpreting Data Using Statistical Models with R
- Time Series Forecasting Using R
- Hypothesis Testing - Interpreting Data with Statistical Models
- Machine Learning with Text Data Using R
- Visualization of Text Data Using Word Cloud in R
- Exploring Data Visually with R