• Labs icon Lab
  • Data
Labs

Statistical Modeling in R

This Code Lab guides learners through essential regression modeling techniques in R using lm() and glm(). By completing the lab, participants will gain hands-on experience in loading datasets, fitting and evaluating linear, logistic, and Poisson regression models, and making predictions. Learners will explore key model diagnostics, interpret statistical outputs, and visualize regression results. This foundational lab prepares participants for real-world data analysis and predictive modeling using R.

Labs

Path Info

Level
Clock icon Advanced
Duration
Clock icon 50m
Published
Clock icon Mar 25, 2025

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Table of Contents

  1. Challenge

    ### Step 0: Getting Started

    Getting started

    In this lab you will fit 3 regression models in R:

    • Linear regression
    • Logistic regression (using the GLM family)
    • Poisson regression (using the GLM family)

    Each of these regression models is a separate Step in this lab.

    Data for each model

    Each regression model uses its own dataset and these files are available in the workspace/ folder which will be your working directory

    RStudio Guide

    To get started, click on the workspace/ folder in the bottom right pane of RStudio. Click on the file entitled Step 1 - LinearRegression.Rmd

    You may want to drag the console pane to be smaller so that you have more room to work. You'll complete each task for Step 1 in that R Markdown file.

    Remember, you must run the cells with the play button at the top right of each cell for a task before moving onto the next task in the R Markdown file. Continue until you have completed all tasks in this step.

    Then when you are ready to move onto the next step, you'll come back and click on the file for the next step i.e. Step 2 - LogisticRegression.Rmd and then Step 3 - PoissonRegression.Rmd until you have completed all tasks in all steps of the lab.

  2. Challenge

    #### Step 1: Linear Regression

    Exploring Linear Regression with R

    To review the concepts covered in this step, please refer to the Build and Interpret Statistical Models module of the Statistical Modeling and Hypothesis Testing in R course.

    To get started, click on the workspace/ folder in the bottom right pane of RStudio. Click on the file entitled Step 1 - LinearRegression.Rmd

    Linear regression is a fundamental statistical technique that helps us understand the relationship between variables. Before fitting a regression model, it is essential to explore and clean the dataset.

    You'll start by loading the dataset, checking for missing values, inspecting its structure, and visualizing key features.


    Task 1.1: Load the Dataset

    First, load the data from the insurance.csv file into an R data frame. Make sure any text columns are treated as categorical variables. Then, display the first few rows of the data frame so you can see what it looks like.

    💡 Hint Use `read.csv()` to load the dataset and set `stringsAsFactors = TRUE` to ensure categorical variables are correctly handled while reading in the data.
    🔑 Solution
    # Load the insurance data
    insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)
    
    # View the first few records of the data
    head(insurance)
    

    Task 1.2: Check for Missing Values

    Count missing values in the insurance data, you will need to identify and deal with them if they are present.

    Identifying missing values is critical to ensure data quality.

    💡 Hint Use `is.na()` combined with `sum()` to count missing values in the dataset.
    🔑 Solution
    # Check for missing values in `insurance`
    sum(is.na(insurance))
    

    Task 1.3: Inspect the Structure of the Dataset and Get Summary Statistics

    View the structure of the data and the summary statistics for the data.

    💡 Hint Use `str()` to inspect the structure and `summary()` to get an overview of the data.
    🔑 Solution
    # Inspect the structure
    str(insurance)
    
    # Get summary statistics
    summary(insurance)
    

    Task 1.4: Explore Data Using Visualizations

    Create a histogram of charges to visualize how charges are distributed. Visualize the relationship between age of an individual and insurance charges using a scatterplot.

    💡 Hint Use `hist()` for distributions and `plot()` to explore relationships between variables.
    🔑 Solution
    # Use a histogram to view distribution of insurance charges
    hist(insurance$charges, 
         main = "Distribution of Insurance Charges", 
         xlab = "Charges", 
         col = "lightblue", 
         border = "white")
    
    # Use a scatterplot to view charges vs. age
    plot(insurance$age, insurance$charges,
         xlab = "Age", ylab = "Charges",
         main = "Charges vs. Age",
         pch = 19, col = "steelblue")
    
    

    Task 1.5: Load the rsample Library to Split Data into Train and Test

    The rsample package is already part of the environment. Include it in your program.

    💡 Hint Use `library()` to include the rsample library for splitting data into training and testing data
    🔑 Solution
    # Load the rsample library
    library(rsample)
    

    Task 1.6: Split the Data (70% Training, 30% Test)

    Split the data into training and test sets in variables called train_data and test_data. train_data should have 70% of the data, and test_data the remaining 30%. Use seed 123 for reproducibility.

    💡 Hint Use `initial_split()` to divide the dataset into training and test sets and the `training()` and `testing()` functions to access the splits
    🔑 Solution
    # Set a reproducible seed
    set.seed(123)
    
    # Split the data
    split <- initial_split(insurance, prop = 0.7)
    
    # Assign the training data
    train_data <- training(split)
    
    # Assign the test data
    test_data  <- testing(split)
    

    Task 1.7: Simple Regression with a Single Predictor on Training Data

    Write code to fit a simple linear regression model on the training data where charges is predicted solely by age. After fitting the model, inspect the model summary to evaluate how well age explains the variability in charges.

    Note that the regression model built using this single predictor has a very low R2. age alone does not have much predictive power.

    💡 Hint Use `lm()` to fit a linear model with `charges ~ age`.
    🔑 Solution
    # Fit a simple regression model with `age` as predictor
    lm_simple <- lm(charges ~ age, data = train_data)
    
    # View summary of model results
    summary(lm_simple)
    

    Task 1.8: Multiple Regression with All Predictors on Training Data

    Write code to fit a regression model using all available predictors (such as age, sex, bmi, children, smoker, and region) to predict charges. Then, review the model summary to compare the coefficients and performance metrics with the simple regression model. This comparison will show you the benefits of including additional variables.

    Note that the R2 shows a significant improvement is now > 0.7.

    💡 Hint Use `lm()` to fit a model using all available predictors (`charges ~ .`).
    🔑 Solution
    # Fit a multiple regression model with all features as predictors
    lm_multiple <- lm(charges ~ ., data = train_data)
    
    # View summary of model results
    summary(lm_multiple)
    

    Task 1.9: Make Predictions on the Test Set

    Now, use the lm_multiple model you created to predict insurance charges on the test_data. Use the predict() function for this. Then, create a new data frame called results. This data frame should have two columns: 'Actual', containing the real charges from the test_data, and 'Predicted', containing the values you just predicted.

    💡 Hint Use `predict()` to generate predictions using the multiple regression model.
    🔑 Solution
    # Make predictions on the test data
    predictions <- predict(lm_multiple, newdata = test_data)
    
    # Get the actual data and the predictions in a single data frame
    results <- data.frame(Actual = test_data$charges, Predicted = predictions)
    

    Task 1.10: Compute the R2 on the test set using caret

    First, load the caret package. This package has a bunch of useful functions for machine learning, including one for calculating R-squared.

    Once you've loaded caret, use its R2() function to calculate the R-squared value for your predictions. You'll need to give it two things: the predictions you made (predictions) and the actual values from your test data (test_data$charges). This will tell you how well your model's predictions match the real values.

    💡 Hint Load the `caret` library and invoke the `R2(predictions, test_data$charges)` function.
    🔑 Solution
    # Load the `caret` package
    library(caret)
    
    # Compute the R2
    R2(predictions, test_data$charges)
    
  3. Challenge

    #### Step 2: Logistic Regression

    Exploring Logistic Regression with R

    To review the concepts covered in this step, please refer to the Build and Interpret Statistical Models module of the Statistical Modeling and Hypothesis Testing in R course.

    To get started, click on the workspace/ folder in the bottom right pane of RStudio. Click on the file entitled Step 2 - LogisticRegression.Rmd

    Logistic regression is a fundamental statistical technique used for modeling binary or categorical outcomes. It helps us understand the relationship between predictor variables and a categorical response variable by estimating probabilities using the logistic function.


    Task 2.1: Load Data for Logistic Regression

    Before starting with logistic regression, you need to load the dataset into R. This dataset in the churn.csv file contains customer information for a telecom service along with whether they churned or not. Make sure you use stringsAsFactors so that categorical variables are read into the program as factors.

    You will fit a logistic regression model to predict whether a customer churned.

    💡 Hint Use `read.csv()` to load the dataset.
    🔑 Solution
    # Load the churn data
    churn <- read.csv("churn.csv", stringsAsFactors = TRUE)
    
    # View the first few records of the data
    head(churn)
    

    Task 2.2: View the Number of Records in Each Category

    Before training a model, it is useful to check how many customers belong to each class (churn vs. no churn). This will help you understand if the dataset is imbalanced, which can affect model performance.

    💡 Hint Use `table()` to count occurrences of each category.
    🔑 Solution
    # View the count in each category
    table(churn$Churn)
    

    Task 2.3: Visualize the Number of Records in Each Category

    A bar plot is a great way to visualize class distribution in the dataset. This helps you quickly see if one category dominates the other, which might impact model predictions.

    💡 Hint Use `barplot()` to visualize the category distribution.
    🔑 Solution
    # Store churn counts in a variable
    churn_counts <- table(churn$Churn)
    
    # Visualize using a bar plot
    barplot(churn_counts, 
         main = "Churn Distribution", 
         xlab = "Churn", 
         ylab = "Count", 
         col = "steelblue")
    

    Task 2.4: Load the rsample Library to Split Data into Train and Test

    To build a predictive model, you must split the dataset into training and testing sets. The rsample package helps you efficiently partition your data.

    💡 Hint Use `library()` to include the rsample library in your program
    🔑 Solution
    # Load the rsample library
    library(rsample)
    

    Task 2.5: Split the Data (70% Training, 30% Test)

    A standard practice in machine learning is to allocate a portion of the data for training and another for testing. Here, 70% of the data will be used for training, while 30% will be reserved for evaluation.

    💡 Hint Use `initial_split()` to divide the dataset. Use `training()` and `testing()` to apportion the training and test data. Use 123 as a reproducible seed.
    🔑 Solution
    # Set a reproducible seed
    set.seed(123)
    
    # Split the data
    split <- initial_split(churn, prop = 0.7)
    
    # Assign the training data
    train_data <- training(split)
    
    # Assign the test data
    test_data  <- testing(split)
    

    Task 2.6: Logistic Regression with All Predictors on Training Data

    Now that the data is split, fit a logistic regression model on the training data using all available features Churn ~ .,. This will allow you to predict whether a customer will churn based on various factors.

    💡 Hint Use `glm()` with `family = binomial`.
    🔑 Solution
    # Fit a logistic regression model with all features as predictors (GLM model)
    logistic_model <- glm(Churn ~ ., family = binomial, data = train_data)
    
    # View summary of model results
    summary(logistic_model)
    

    Task 2.7: Compute the odds ratio

    The odds ratio in a logistic model quantifies how a one-unit change in a predictor variable affects the odds of the outcome occurring, holding other variables constant.

    💡 Hint Use `exp(coef())` to compute the odds ratio
    🔑 Solution
    # Compute the odds ratio	
    odds_ratio <- exp(coef(logistic_model))
    
    odds_ratio
    

    Task 2.8: Make Predictions on the Test Set and Construct Confusion Matrix

    Once the model is trained, test it by making predictions on the test dataset. A confusion matrix will help evaluate classification performance. Use a threshold of 0.5 for the confusion matrix.

    💡 Hint Use `predict()` to make predictions and `table()` to construct a confusion matrix.
    🔑 Solution
    # Make predictions on the test data
    predict_test <- predict(logistic_model, type = "response", newdata = test_data)
    
    # Construct a confusion matrix with threshold = 0.5
    test_table <- table(test_data$Churn, predict_test > 0.5)
    
    test_table
    

    Task 2.9: Extract Confusion Matrix Components

    A confusion matrix breaks down the model’s predictions into true positives, false positives, true negatives, and false negatives. Extracting these values allows you to compute accuracy, precision, and recall for the model.

    💡 Hint Access matrix values using indexing. e.g. ` test_table[1, 1]` gives us the true negatives.
    🔑 Solution
    # True negatives
    true_negatives <- test_table[1, 1]
    
    # False positives
    false_positives <- test_table[1, 2]
    
    # False negatives
    false_negatives <- test_table[2, 1]
    
    # True positives
    true_positives <- test_table[2, 2]
    

    Task 2.10: Calculate and Print Performance Metrics: Accuracy, Precision, and Recall

    Accuracy, precision, and recall are key metrics for evaluating a classification model. These metrics provide insights into the model’s effectiveness in predicting churn.

    💡 Hint Use mathematical formulas to compute accuracy, precision, and recall. e.g. `accuracy <- (true_positives + true_negatives) / sum(test_table)`
    🔑 Solution
    # Calculate and print accuracy
    accuracy <- (true_positives + true_negatives) / sum(test_table)
    cat("Accuracy:", accuracy, "\n")
    
    # Calculate and print precision
    precision <- true_positives / (true_positives + false_positives)
    cat("Precision:", precision, "\n")
    
    # Calculate and print recall
    recall <- true_positives / (true_positives + false_negatives)
    cat("Recall:", recall, "\n")
    
  4. Challenge

    ### Step 3: Poisson Regression

    Exploring Poisson Regression with R

    To review the concepts covered in this step, please refer to the Build and Interpret Statistical Models module of the Statistical Modeling and Hypothesis Testing in R course.

    To get started, click on the workspace/ folder in the bottom right pane of RStudio. Click on the file entitled Step 3 - PoissonRegression.Rmd

    Poisson regression is a fundamental statistical technique used for modeling count data, where the response variable represents the number of occurrences of an event in a fixed interval of time or space. It assumes that the count data follows a Poisson distribution and uses the log link function to model the relationship between predictor variables and the expected count.


    Task 3.1: Load Dataset and View Records

    Before performing Poisson regression, load the dataset and inspect its structure to ensure it is correctly formatted.

    💡 Hint Use `read.csv()` to load the dataset and `head()` to preview the first few rows.
    🔑 Solution
    # Load the dataset
    accident_data <- read.csv("accidents_data.csv", stringsAsFactors = TRUE)
    
    # View first few rows
    head(accident_data)
    

    Task 3.2: Quick Summary of the Data

    Summarizing the dataset helps identify missing values, outliers, and key statistics for each variable.

    💡 Hint Use `summary()` to generate summary statistics of the dataset.
    🔑 Solution
    # Get summary statistics for accidents
    summary(accident_data)
    

    Task 3.3: Check Poisson Distribution Assumptions

    For Poisson regression, the mean and variance of the dependent variable should be approximately equal. Compute these values to check the assumption.

    💡 Hint Use `mean()` and `var()` to calculate these statistics.
    🔑 Solution
    # Calculate mean and variance of accident occurrences
    mean_accidents <- mean(accident_data$Accidents)
    var_accidents <- var(accident_data$Accidents)
    
    # Mean ~ Variance
    print(paste("Mean:", mean_accidents))
    print(paste("Variance:", var_accidents))
    

    Task 3.4: Compute Dispersion Ratio

    The dispersion ratio, which is the variance divided by the mean, should be close to 1 for a Poisson distribution.

    💡 Hint Use `var_accidents / mean_accidents` to compute the ratio.
    🔑 Solution
    # Compute dispersion ratio (should be close to 1)
    dispersion_ratio <- var_accidents / mean_accidents
    print(paste("Dispersion Ratio:", dispersion_ratio))
    

    Task 3.5: Visualize Average Accidents and Traffic Volume per Weekday/Weekend

    To understand patterns in the data, visualize how accidents and traffic volume vary across different days.

    💡 Hint Use `aggregate()` to compute averages and `barplot()` to visualize the results.
    🔑 Solution
    # Calculate average accidents on weekday/weekend
    avg_accidents <- aggregate(Accidents ~ Weekend, data = accident_data, FUN = mean)
    
    # Bar plot for average accidents on weekday/weekend
    barplot(avg_accidents$Accidents, names.arg = avg_accidents$Weekend, 
            main = "Average Accidents on Weekday/Weekend", col = "steelblue", 
            xlab = "Weekend", ylab = "Average Accidents")
    
    # Calculate average traffic volume on weekday/weekend
    avg_traffic <- aggregate(TrafficVolume ~ Weekend, data = accident_data, FUN = mean)
    
    # Bar plot for average traffic volume weekday/weekend
    barplot(avg_traffic$TrafficVolume, names.arg = avg_traffic$Weekend, 
            main = "Average Traffic Volume  Weekday/Weekend", col = "darkred", 
            xlab = "Weekend", ylab = "Average Traffic Volume")
    

    Task 3.6: Fit a Poisson Regression Model

    Now, fit a Poisson regression model to examine how Weekend and TrafficVolume influence accident occurrences.

    💡 Hint Use `glm()` with `family = poisson` to specify a Poisson regression model.
    🔑 Solution
    # Fit Poisson model
    poisson_model <- glm(Accidents ~ Weekend + TrafficVolume, 
                         family = poisson, data = accident_data)
    
    # Display model summary
    summary(poisson_model)
    

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.