Implementing Marketing Analytics in R: Part 1
Feb 15, 2020 • 16 Minute Read
Introduction
Marketing and customer-related decisions are a top priority for every business. With the help of statistical modeling and analytics, it's possible to support decision makers and help them make strategic decisions based on data, not just instincts. The integration of statistical modeling with marketing strategy can also be referred to as marketing analytics.
In this series of two guides, you will learn important techniques for implementing marketing analytics in R.
This guide, Part 1, will cover:
-
Customer Churn Prediction
-
RFM Analysis
The next guide, Part 2, will cover:
-
Clustering
-
Sales Forecasting
-
Other Areas
Let's start by loading the required libraries.
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(ggplot2)
library(repr)
library(caret)
Churn Prediction
Customer acquisition is more expensive than retention. That's why it makes business sense to retain customers, especially profitable ones. Machine learning models can model the probability a customer will leave, or churn. This can then be used to target valuable customers and retain those at risk. We'll build a logistic regression model to predict customer churn.
Data
In this guide, we will use a fictitious dataset of retail banking customers containing 600 observations and 10 variables, as described below:
-
Marital_status: Whether the customer is married ("Yes") or not ("No").
-
Is_graduate: Whether the customer is a graduate ("Yes") or not ("No").
-
Income: Annual income of the customer (in USD).
-
Loan_pending: Outstanding loan amount (in USD) still to be paid by the customer.
-
Satisfaction_score: Satisfaction level of the customer.
-
Churn: Whether the customer churned ("Yes") or not ("No").
-
Age: The applicant's age in years.
-
Sex: Whether the applicant is a male ("M") or a female ("F").
-
Investment: Total investment in stocks and mutual funds (in USD) held by the customer.
-
Purpose: Purpose of loan related to the Loan_pending variable.
Let's start by loading the data.
df_churn = read_csv("data_churn.csv")
glimpse(df_churn)
Output:
Observations: 600
Variables: 10
$ Marital_status <chr> "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "No",...
$ Is_graduate <chr> "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Ye...
$ Income <int> 7000, 8990, 13330, 13670, 19230, 23450, 24000, 2471...
$ Loan_pending <dbl> 900.0, 809.1, 1199.7, 1230.3, 1730.7, 1876.0, 1920....
$ Satisfaction_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Sa...
$ Churn <chr> "Yes", "Yes", "No", "Yes", "No", "No", "Yes", "No",...
$ Age <int> 29, 29, 25, 29, 25, 33, 37, 46, 28, 35, 35, 32, 27,...
$ Sex <chr> "F", "M", "M", "M", "M", "M", "M", "M", "M", "M", "...
$ Investment <dbl> 2100, 6293, 9331, 9569, 13461, 16415, 16800, 17297,...
$ Purpose <chr> "Travel", "Travel", "Travel", "Travel", "Travel", "...
The output shows that the dataset has four numerical variables (labeled as int or dbl) and six character variables (labeled as chr). We will convert these into factor variables using the line of code below.
names <- c(1,2,5,6,8,10)
df_churn[,names] <- lapply(df_churn[,names] , factor)
glimpse(df_churn)
Output:
Observations: 600
Variables: 10
$ Marital_status <fct> No, Yes, Yes, Yes, No, Yes, No, No, Yes, Yes, No, Y...
$ Is_graduate <fct> Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, No, No, No, ...
$ Income <int> 7000, 8990, 13330, 13670, 19230, 23450, 24000, 2471...
$ Loan_pending <dbl> 900.0, 809.1, 1199.7, 1230.3, 1730.7, 1876.0, 1920....
$ Satisfaction_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfact...
$ Churn <fct> Yes, Yes, No, Yes, No, No, Yes, No, Yes, Yes, Yes, ...
$ Age <int> 29, 29, 25, 29, 25, 33, 37, 46, 28, 35, 35, 32, 27,...
$ Sex <fct> F, M, M, M, M, M, M, M, M, M, F, M, M, F, F, M, M, ...
$ Investment <dbl> 2100, 6293, 9331, 9569, 13461, 16415, 16800, 17297,...
$ Purpose <fct> Travel, Travel, Travel, Travel, Travel, Travel, Tra...
Data Partitioning
We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation method for evaluating model performance.
The first line of code below sets the random seed for reproducibility of results. The second line loads the caTools package that will be used for data partitioning, while the third to fifth lines create the training and test sets. The training set contains 70 percent of the data (420 observations of 10 variables) and the test set contains the remaining 30 percent (180 observations of 10 variables).
set.seed(100)
library(caTools)
spl = sample.split(df_churn$Churn, SplitRatio = 0.70)
train = subset(df_churn, spl==TRUE)
test = subset(df_churn, spl==FALSE)
print(dim(train)); print(dim(test))
Output:
1] 420 10
[1] 180 10
Baseline Accuracy
The next step is to estimate baseline accuracy, one of the initial model evaluation techniques. The code below creates the proportion table for the target class. Since the majority class of the target variable has a proportion of 0.68, the baseline accuracy is 68 percent.
prop.table(table(train$Churn))
Output:
No Yes
0.3166667 0.6833333
Build, Predict and Evaluate the Model
In order to fit the model, the first step is to instantiate the algorithm using the glm() function. The second line prints the summary of the trained model.
model_glm = glm(Churn ~ . , family="binomial", data = train)
summary(model_glm)
Output:
Call:
glm(formula = Churn ~ ., family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.24561 -0.00004 0.00004 0.00007 2.23620
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.025e+00 8.100e+03 0.000 0.9999
Marital_statusYes 4.330e-01 4.566e-01 0.948 0.3430
Is_graduateYes 9.686e-01 4.571e-01 2.119 0.0341 *
Income 8.054e-08 9.276e-06 0.009 0.9931
Loan_pending 1.486e-05 3.188e-05 0.466 0.6411
Satisfaction_scoreSatisfactory 2.284e+01 7.841e+03 0.003 0.9977
Age -6.213e-02 1.279e-02 -4.859 1.18e-06 ***
SexM 1.857e-01 5.599e-01 0.332 0.7402
Investment -1.604e-06 1.378e-05 -0.116 0.9073
PurposeHome 2.002e+00 8.100e+03 0.000 0.9998
PurposeOthers -4.128e+01 3.081e+03 -0.013 0.9893
PurposePersonal 1.388e+00 2.568e+03 0.001 0.9996
PurposeTravel -1.942e+01 2.030e+03 -0.010 0.9924
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 524.44 on 419 degrees of freedom
Residual deviance: 168.04 on 407 degrees of freedom
AIC: 194.04
Number of Fisher Scoring iterations: 19
The significance code *** in the above output shows the relative importance of the feature variables.
Let's now evaluate the model performance, which should be higher than the baseline accuracy. We start with the training data, where the first line of code generates predictions on the train set. The second line creates the confusion matrix with a threshold of 0.5, which means that for probability predictions greater than or equal to 0.5, the algorithm will predict a 'Yes' response for 'Churn' variable. The third line prints the accuracy of the model on the training data using the confusion matrix, and the accuracy comes out to be 90 percent.
We repeat this process on the test data, and the accuracy comes out to be 89 percent.
# Predictions on the training set
predictTrain = predict(model_glm, data = train, type = "response")
# Confusion matrix on training data
table(train$Churn, predictTrain >= 0.5)
(114+263)/nrow(train) #Accuracy - 90%
#Predictions on the test set
predictTest = predict(model_glm, newdata = test, type = "response")
# Confusion matrix on test set
table(test$Churn, predictTest >= 0.5)
161/nrow(test) #Accuracy - 89%
Output:
FALSE TRUE
No 114 19
Yes 24 263
[1] 0.897619
FALSE TRUE
No 46 11
Yes 8 115
[1] 0.8944444
RFM Analysis
RFM (Recency, Frequency, and Monetary) analysis is a technique that uses customer transaction data to determine the best customers based on how recently they have purchased, how often they purchase, and how much they spend.
Data
For the RFM analysis, we'll use a fictitious dataset of retail store customers containing 92 observations and 3 variables, as described below:
-
CustId: The unique customer number.
-
Purchase_date: The date of purchase.
-
Purchase_value: Value of the purchase (in USD).
Let's load the data and look at the structure.
df_rfm = read_csv("RFM.csv")
glimpse(df_rfm)
Output:
Observations: 92
Variables: 3
$ CustId <chr> "Id1", "Id2", "Id3", "Id4", "Id5", "Id6", "Id7", "Id8",...
$ Purchase_date <chr> "01-Oct-19", "02-Oct-19", "03-Oct-19", "04-Oct-19", "05...
$ Purchase_value <dbl> 19.2, 19.8, 19.7, 21.3, 20.2, 18.6, 21.5, 21.3, 21.3, 2...
We are interested in customer level analytics, so let's look at the unique number of customers using the code below.
length(unique(df_rfm$CustId))
Output:
1] 25
The output shows there are twenty five unique customers. We'll perform the RFM analysis on this data, but before that, we have to convert the date variable into the proper format, which is done using the first line of code below.
We can observe that the Purchase_date variable covers the time period between October 1, 2019, and December 31, 2019. In order to calculate recency, we'll create a new variable, days_diff, that measures the difference between the date of purchase and reference date, which is set to be January 1, 2020. The second line of code creates this variable, while the third line prints the data structure.
df_rfm$Purchase_date = as.Date(df_rfm$Purchase_date, "%d-%b-%y")
df_rfm$days_diff = round(as.numeric(difftime(time1 = "2020-01-01",
time2 = df_rfm$Purchase_date,
units = "days")),0)
glimpse(df_rfm)
Output:
Observations: 92
Variables: 4
$ CustId <chr> "Id1", "Id2", "Id3", "Id4", "Id5", "Id6", "Id7", "Id8",...
$ Purchase_date <date> 2019-10-01, 2019-10-02, 2019-10-03, 2019-10-04, 2019-1...
$ Purchase_value <dbl> 19.2, 19.8, 19.7, 21.3, 20.2, 18.6, 21.5, 21.3, 21.3, 2...
$ days_diff <dbl> 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81, 80, 79,...
We are now ready to perform the RFM analysis and will use the sqldf() package for doing the computation. The package is loaded in the first line of code below, while the second line performs the computation. The output is stored in the new data frame RFM_data, which can be displayed with the third line of code.
library(sqldf)
# Compute recency, frequency, and average purchase amount
RFM_data = sqldf("SELECT CustId,
MIN(days_diff) AS 'Recency',
COUNT(*) AS 'Frequency',
AVG(Purchase_value) AS 'Monetary'
FROM df_rfm GROUP BY 1")
glimpse(RFM_data)
Output:
Observations: 25
Variables: 4
$ CustId <chr> "Id1", "Id10", "Id11", "Id12", "Id13", "Id14", "Id15", "Id16...
$ Recency <dbl> 72, 63, 22, 21, 20, 19, 18, 17, 10, 9, 8, 1, 7, 6, 5, 4, 37,...
$ Frequency <int> 3, 3, 2, 2, 2, 3, 3, 3, 4, 4, 4, 8, 6, 5, 5, 5, 2, 2, 4, 4, ...
$ Monetary <dbl> 23.20000, 25.83333, 32.45000, 32.85000, 33.95000, 32.23333, ...
The resulting data frame has the variable CustID and the calculated variables, Recency, Frequency, and Monetary. There are 25 observations in line with the unique number of customers in the original data.
Interpretation of the RFM score
There are several ways in which the RFM analysis can be used for marketing-related decisions. One approach is to look at the percentile values of these metrics, which can be done with the line of code below.
summary(RFM_data)
Output:
CustId Recency Frequency Monetary
Length:25 Min. : 1.00 Min. :2.00 Min. :23.20
Class :character 1st Qu.: 9.00 1st Qu.:3.00 1st Qu.:28.98
Mode :character Median :19.00 Median :4.00 Median :32.12
Mean :24.04 Mean :3.68 Mean :30.82
3rd Qu.:25.00 3rd Qu.:4.00 3rd Qu.:32.85
Max. :72.00 Max. :8.00 Max. :36.07
We can draw the following inferences from the above output.
-
Recency: The mean recency for customers is 24 days, while the 75th percentile value is 25 days.
-
`Frequency: The mean frequency for the customers is 3.68 purchases, while the 75th percentile value is 4 purchases.
-
Monetary: The mean purchase value is $30.82, while the 75th percentile value is $32.85.
We observe that there are variations in these metrics across customers, and the marketing team can develop a customized strategy to target these customers basis their recency, frequency, and monetary values.
Conclusion
In this guide, you learned about two of the most popular use cases of marketing analytics, customer churn prediction and RFM analysis.
In Part 2, you'll learn about clustering, sales forecasting, and other techniques.
To learn more about data science with R, please refer to the following guides: