Lab
Data

Statistical Modeling in R

This Code Lab guides learners through essential regression modeling techniques in R using lm() and glm(). By completing the lab, participants will gain hands-on experience in loading datasets, fitting and evaluating linear, logistic, and Poisson regression models, and making predictions. Learners will explore key model diagnostics, interpret statistical outputs, and visualize regression results. This foundational lab prepares participants for real-world data analysis and predictive modeling using R.

Get started Contact sales

Path Info

Level

Advanced

Duration

47m

Published

Mar 25, 2025

Challenge

### Step 0: Getting Started
Getting started

In this lab you will fit 3 regression models in R:
- Linear regression
- Logistic regression (using the GLM family)
- Poisson regression (using the GLM family)
Each of these regression models is a separate Step in this lab.

Data for each model

Each regression model uses its own dataset and these files are available in the workspace/ folder which will be your working directory

RStudio Guide

To get started, click on the workspace/ folder in the bottom right pane of RStudio. Click on the file entitled Step 1 - LinearRegression.Rmd

You may want to drag the console pane to be smaller so that you have more room to work. You'll complete each task for Step 1 in that R Markdown file.

Remember, you must run the cells with the play button at the top right of each cell for a task before moving onto the next task in the R Markdown file. Continue until you have completed all tasks in this step.

Then when you are ready to move onto the next step, you'll come back and click on the file for the next step i.e. Step 2 - LogisticRegression.Rmd and then Step 3 - PoissonRegression.Rmd until you have completed all tasks in all steps of the lab.
Challenge

#### Step 1: Linear Regression
Exploring Linear Regression with R

To review the concepts covered in this step, please refer to the Build and Interpret Statistical Models module of the Statistical Modeling and Hypothesis Testing in R course.

To get started, click on the workspace/ folder in the bottom right pane of RStudio. Click on the file entitled Step 1 - LinearRegression.Rmd

Linear regression is a fundamental statistical technique that helps us understand the relationship between variables. Before fitting a regression model, it is essential to explore and clean the dataset.

You'll start by loading the dataset, checking for missing values, inspecting its structure, and visualizing key features.

Task 1.1: Load the Dataset

First, load the data from the insurance.csv file into an R data frame. Make sure any text columns are treated as categorical variables. Then, display the first few rows of the data frame so you can see what it looks like.

💡 Hint
Use `read.csv()` to load the dataset and set `stringsAsFactors = TRUE` to ensure categorical variables are correctly handled while reading in the data.
🔑 Solution
```
# Load the insurance data
insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)

# View the first few records of the data
head(insurance)
```
Task 1.2: Check for Missing Values

Count missing values in the insurance data, you will need to identify and deal with them if they are present.

Identifying missing values is critical to ensure data quality.

💡 Hint
Use `is.na()` combined with `sum()` to count missing values in the dataset.
🔑 Solution
```
# Check for missing values in `insurance`
sum(is.na(insurance))
```
Task 1.3: Inspect the Structure of the Dataset and Get Summary Statistics

View the structure of the data and the summary statistics for the data.

💡 Hint
Use `str()` to inspect the structure and `summary()` to get an overview of the data.
🔑 Solution
```
# Inspect the structure
str(insurance)

# Get summary statistics
summary(insurance)
```
Task 1.4: Explore Data Using Visualizations

Create a histogram of charges to visualize how charges are distributed. Visualize the relationship between age of an individual and insurance charges using a scatterplot.

💡 Hint
Use `hist()` for distributions and `plot()` to explore relationships between variables.
🔑 Solution
```
# Use a histogram to view distribution of insurance charges
hist(insurance$charges, 
     main = "Distribution of Insurance Charges", 
     xlab = "Charges", 
     col = "lightblue", 
     border = "white")

# Use a scatterplot to view charges vs. age
plot(insurance$age, insurance$charges,
     xlab = "Age", ylab = "Charges",
     main = "Charges vs. Age",
     pch = 19, col = "steelblue")
```
Task 1.5: Load the rsample Library to Split Data into Train and Test

The rsample package is already part of the environment. Include it in your program.

💡 Hint
Use `library()` to include the rsample library for splitting data into training and testing data
🔑 Solution
```
# Load the rsample library
library(rsample)
```
Task 1.6: Split the Data (70% Training, 30% Test)

Split the data into training and test sets in variables called train_data and test_data. train_data should have 70% of the data, and test_data the remaining 30%. Use seed 123 for reproducibility.

💡 Hint
Use `initial_split()` to divide the dataset into training and test sets and the `training()` and `testing()` functions to access the splits
🔑 Solution
```
# Set a reproducible seed
set.seed(123)

# Split the data
split <- initial_split(insurance, prop = 0.7)

# Assign the training data
train_data <- training(split)

# Assign the test data
test_data  <- testing(split)
```
Task 1.7: Simple Regression with a Single Predictor on Training Data

Write code to fit a simple linear regression model on the training data where charges is predicted solely by age. After fitting the model, inspect the model summary to evaluate how well age explains the variability in charges.

Note that the regression model built using this single predictor has a very low R2. age alone does not have much predictive power.

💡 Hint
Use `lm()` to fit a linear model with `charges ~ age`.
🔑 Solution
```
# Fit a simple regression model with `age` as predictor
lm_simple <- lm(charges ~ age, data = train_data)

# View summary of model results
summary(lm_simple)
```
Task 1.8: Multiple Regression with All Predictors on Training Data

Write code to fit a regression model using all available predictors (such as age, sex, bmi, children, smoker, and region) to predict charges. Then, review the model summary to compare the coefficients and performance metrics with the simple regression model. This comparison will show you the benefits of including additional variables.

Note that the R2 shows a significant improvement is now > 0.7.

💡 Hint
Use `lm()` to fit a model using all available predictors (`charges ~ .`).
🔑 Solution
```
# Fit a multiple regression model with all features as predictors
lm_multiple <- lm(charges ~ ., data = train_data)

# View summary of model results
summary(lm_multiple)
```
Task 1.9: Make Predictions on the Test Set

Now, use the lm_multiple model you created to predict insurance charges on the test_data. Use the predict() function for this. Then, create a new data frame called results. This data frame should have two columns: 'Actual', containing the real charges from the test_data, and 'Predicted', containing the values you just predicted.

💡 Hint
Use `predict()` to generate predictions using the multiple regression model.
🔑 Solution
```
# Make predictions on the test data
predictions <- predict(lm_multiple, newdata = test_data)

# Get the actual data and the predictions in a single data frame
results <- data.frame(Actual = test_data$charges, Predicted = predictions)
```
Task 1.10: Compute the R2 on the test set using caret

First, load the caret package. This package has a bunch of useful functions for machine learning, including one for calculating R-squared.

Once you've loaded caret, use its R2() function to calculate the R-squared value for your predictions. You'll need to give it two things: the predictions you made (predictions) and the actual values from your test data (test_data$charges). This will tell you how well your model's predictions match the real values.

💡 Hint
Load the `caret` library and invoke the `R2(predictions, test_data$charges)` function.
🔑 Solution
```
# Load the `caret` package
library(caret)

# Compute the R2
R2(predictions, test_data$charges)
```
Challenge

#### Step 2: Logistic Regression
Exploring Logistic Regression with R

To review the concepts covered in this step, please refer to the Build and Interpret Statistical Models module of the Statistical Modeling and Hypothesis Testing in R course.

To get started, click on the workspace/ folder in the bottom right pane of RStudio. Click on the file entitled Step 2 - LogisticRegression.Rmd

Logistic regression is a fundamental statistical technique used for modeling binary or categorical outcomes. It helps us understand the relationship between predictor variables and a categorical response variable by estimating probabilities using the logistic function.

Task 2.1: Load Data for Logistic Regression

Before starting with logistic regression, you need to load the dataset into R. This dataset in the churn.csv file contains customer information for a telecom service along with whether they churned or not. Make sure you use stringsAsFactors so that categorical variables are read into the program as factors.

You will fit a logistic regression model to predict whether a customer churned.

💡 Hint
Use `read.csv()` to load the dataset.
🔑 Solution
```
# Load the churn data
churn <- read.csv("churn.csv", stringsAsFactors = TRUE)

# View the first few records of the data
head(churn)
```
Task 2.2: View the Number of Records in Each Category

Before training a model, it is useful to check how many customers belong to each class (churn vs. no churn). This will help you understand if the dataset is imbalanced, which can affect model performance.

💡 Hint
Use `table()` to count occurrences of each category.
🔑 Solution
```
# View the count in each category
table(churn$Churn)
```
Task 2.3: Visualize the Number of Records in Each Category

A bar plot is a great way to visualize class distribution in the dataset. This helps you quickly see if one category dominates the other, which might impact model predictions.

💡 Hint
Use `barplot()` to visualize the category distribution.
🔑 Solution
```
# Store churn counts in a variable
churn_counts <- table(churn$Churn)

# Visualize using a bar plot
barplot(churn_counts, 
     main = "Churn Distribution", 
     xlab = "Churn", 
     ylab = "Count", 
     col = "steelblue")
```
Task 2.4: Load the rsample Library to Split Data into Train and Test

To build a predictive model, you must split the dataset into training and testing sets. The rsample package helps you efficiently partition your data.

💡 Hint
Use `library()` to include the rsample library in your program
🔑 Solution
```
# Load the rsample library
library(rsample)
```
Task 2.5: Split the Data (70% Training, 30% Test)

A standard practice in machine learning is to allocate a portion of the data for training and another for testing. Here, 70% of the data will be used for training, while 30% will be reserved for evaluation.

💡 Hint
Use `initial_split()` to divide the dataset. Use `training()` and `testing()` to apportion the training and test data. Use 123 as a reproducible seed.
🔑 Solution
```
# Set a reproducible seed
set.seed(123)

# Split the data
split <- initial_split(churn, prop = 0.7)

# Assign the training data
train_data <- training(split)

# Assign the test data
test_data  <- testing(split)
```
Task 2.6: Logistic Regression with All Predictors on Training Data

Now that the data is split, fit a logistic regression model on the training data using all available features Churn ~ .,. This will allow you to predict whether a customer will churn based on various factors.

💡 Hint
Use `glm()` with `family = binomial`.
🔑 Solution
```
# Fit a logistic regression model with all features as predictors (GLM model)
logistic_model <- glm(Churn ~ ., family = binomial, data = train_data)

# View summary of model results
summary(logistic_model)
```
Task 2.7: Compute the odds ratio

The odds ratio in a logistic model quantifies how a one-unit change in a predictor variable affects the odds of the outcome occurring, holding other variables constant.

💡 Hint
Use `exp(coef())` to compute the odds ratio
🔑 Solution
```
# Compute the odds ratio	
odds_ratio <- exp(coef(logistic_model))

odds_ratio
```
Task 2.8: Make Predictions on the Test Set and Construct Confusion Matrix

Once the model is trained, test it by making predictions on the test dataset. A confusion matrix will help evaluate classification performance. Use a threshold of 0.5 for the confusion matrix.

💡 Hint
Use `predict()` to make predictions and `table()` to construct a confusion matrix.
🔑 Solution
```
# Make predictions on the test data
predict_test <- predict(logistic_model, type = "response", newdata = test_data)

# Construct a confusion matrix with threshold = 0.5
test_table <- table(test_data$Churn, predict_test > 0.5)

test_table
```
Task 2.9: Extract Confusion Matrix Components

A confusion matrix breaks down the model’s predictions into true positives, false positives, true negatives, and false negatives. Extracting these values allows you to compute accuracy, precision, and recall for the model.

💡 Hint
Access matrix values using indexing. e.g. ` test_table[1, 1]` gives us the true negatives.
🔑 Solution
```
# True negatives
true_negatives <- test_table[1, 1]

# False positives
false_positives <- test_table[1, 2]

# False negatives
false_negatives <- test_table[2, 1]

# True positives
true_positives <- test_table[2, 2]
```
Task 2.10: Calculate and Print Performance Metrics: Accuracy, Precision, and Recall

Accuracy, precision, and recall are key metrics for evaluating a classification model. These metrics provide insights into the model’s effectiveness in predicting churn.

💡 Hint
Use mathematical formulas to compute accuracy, precision, and recall. e.g. `accuracy <- (true_positives + true_negatives) / sum(test_table)`
🔑 Solution
```
# Calculate and print accuracy
accuracy <- (true_positives + true_negatives) / sum(test_table)
cat("Accuracy:", accuracy, "\n")

# Calculate and print precision
precision <- true_positives / (true_positives + false_positives)
cat("Precision:", precision, "\n")

# Calculate and print recall
recall <- true_positives / (true_positives + false_negatives)
cat("Recall:", recall, "\n")
```
Challenge

### Step 3: Poisson Regression
Exploring Poisson Regression with R

To review the concepts covered in this step, please refer to the Build and Interpret Statistical Models module of the Statistical Modeling and Hypothesis Testing in R course.

To get started, click on the workspace/ folder in the bottom right pane of RStudio. Click on the file entitled Step 3 - PoissonRegression.Rmd

Poisson regression is a fundamental statistical technique used for modeling count data, where the response variable represents the number of occurrences of an event in a fixed interval of time or space. It assumes that the count data follows a Poisson distribution and uses the log link function to model the relationship between predictor variables and the expected count.

Task 3.1: Load Dataset and View Records

Before performing Poisson regression, load the dataset and inspect its structure to ensure it is correctly formatted.

💡 Hint
Use `read.csv()` to load the dataset and `head()` to preview the first few rows.
🔑 Solution
```
# Load the dataset
accident_data <- read.csv("accidents_data.csv", stringsAsFactors = TRUE)

# View first few rows
head(accident_data)
```
Task 3.2: Quick Summary of the Data

Summarizing the dataset helps identify missing values, outliers, and key statistics for each variable.

💡 Hint
Use `summary()` to generate summary statistics of the dataset.
🔑 Solution
```
# Get summary statistics for accidents
summary(accident_data)
```
Task 3.3: Check Poisson Distribution Assumptions

For Poisson regression, the mean and variance of the dependent variable should be approximately equal. Compute these values to check the assumption.

💡 Hint
Use `mean()` and `var()` to calculate these statistics.
🔑 Solution
```
# Calculate mean and variance of accident occurrences
mean_accidents <- mean(accident_data$Accidents)
var_accidents <- var(accident_data$Accidents)

# Mean ~ Variance
print(paste("Mean:", mean_accidents))
print(paste("Variance:", var_accidents))
```
Task 3.4: Compute Dispersion Ratio

The dispersion ratio, which is the variance divided by the mean, should be close to 1 for a Poisson distribution.

💡 Hint
Use `var_accidents / mean_accidents` to compute the ratio.
🔑 Solution
```
# Compute dispersion ratio (should be close to 1)
dispersion_ratio <- var_accidents / mean_accidents
print(paste("Dispersion Ratio:", dispersion_ratio))
```
Task 3.5: Visualize Average Accidents and Traffic Volume per Weekday/Weekend

To understand patterns in the data, visualize how accidents and traffic volume vary across different days.

💡 Hint
Use `aggregate()` to compute averages and `barplot()` to visualize the results.
🔑 Solution
```
# Calculate average accidents on weekday/weekend
avg_accidents <- aggregate(Accidents ~ Weekend, data = accident_data, FUN = mean)

# Bar plot for average accidents on weekday/weekend
barplot(avg_accidents$Accidents, names.arg = avg_accidents$Weekend, 
        main = "Average Accidents on Weekday/Weekend", col = "steelblue", 
        xlab = "Weekend", ylab = "Average Accidents")

# Calculate average traffic volume on weekday/weekend
avg_traffic <- aggregate(TrafficVolume ~ Weekend, data = accident_data, FUN = mean)

# Bar plot for average traffic volume weekday/weekend
barplot(avg_traffic$TrafficVolume, names.arg = avg_traffic$Weekend, 
        main = "Average Traffic Volume  Weekday/Weekend", col = "darkred", 
        xlab = "Weekend", ylab = "Average Traffic Volume")
```
Task 3.6: Fit a Poisson Regression Model

Now, fit a Poisson regression model to examine how Weekend and TrafficVolume influence accident occurrences.

💡 Hint
Use `glm()` with `family = poisson` to specify a Poisson regression model.
🔑 Solution
```
# Fit Poisson model
poisson_model <- glm(Accidents ~ Weekend + TrafficVolume, 
                     family = poisson, data = accident_data)

# Display model summary
summary(poisson_model)
```

Author

Janani Ravi

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.

Ready to get started?

View individual plans View team plans

Statistical Modeling in R

Path Info

Table of Contents

### Step 0: Getting Started

Getting started

Data for each model

RStudio Guide

#### Step 1: Linear Regression

Exploring Linear Regression with R

Task 1.1: Load the Dataset

Task 1.2: Check for Missing Values

Task 1.3: Inspect the Structure of the Dataset and Get Summary Statistics

Task 1.4: Explore Data Using Visualizations

Task 1.5: Load the rsample Library to Split Data into Train and Test

Task 1.6: Split the Data (70% Training, 30% Test)

Task 1.7: Simple Regression with a Single Predictor on Training Data

Task 1.8: Multiple Regression with All Predictors on Training Data

Task 1.9: Make Predictions on the Test Set

Task 1.10: Compute the R2 on the test set using caret

#### Step 2: Logistic Regression

Exploring Logistic Regression with R

Task 2.1: Load Data for Logistic Regression

Task 2.2: View the Number of Records in Each Category

Task 2.3: Visualize the Number of Records in Each Category

Task 2.4: Load the rsample Library to Split Data into Train and Test

Task 2.5: Split the Data (70% Training, 30% Test)

Task 2.6: Logistic Regression with All Predictors on Training Data

Task 2.7: Compute the odds ratio

Task 2.8: Make Predictions on the Test Set and Construct Confusion Matrix

Task 2.9: Extract Confusion Matrix Components

Task 2.10: Calculate and Print Performance Metrics: Accuracy, Precision, and Recall

### Step 3: Poisson Regression

Exploring Poisson Regression with R

Task 3.1: Load Dataset and View Records

Task 3.2: Quick Summary of the Data

Task 3.3: Check Poisson Distribution Assumptions

Task 3.4: Compute Dispersion Ratio

Task 3.5: Visualize Average Accidents and Traffic Volume per Weekday/Weekend

Task 3.6: Fit a Poisson Regression Model

What's a lab?

Provided environment for hands-on practice

Guided walkthrough

Did you know?

Task 1.5: Load the `rsample` Library to Split Data into Train and Test

Task 2.4: Load the `rsample` Library to Split Data into Train and Test