Summarizing Data and Deducing Probabilities
Apr 13, 2020 • 13 Minute Read
Introduction
Summarizing data is undoubtedly one of the most common data science and analytics tasks. For predictive modeling, you also need to understand the concept of probability, which forms the basis of many machine learning algorithms like logistic regression. In this guide, you will learn the techniques of summarizing data and deducing probabilities in R.
Data
In this guide, you'll use a fictitious dataset of loan applications containing 600 observations and nine variables, as described below:
-
Marital_status: Whether the applicant is married ("Yes") or not ("No").
-
Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").
-
Income: Annual income of the applicant (in USD).
-
Loan_amount: Loan amount (in USD) for which the application was submitted.
-
Credit_score: Whether the applicant's credit score is good ("Satisfactory") or not ("Not_satisfactory").
-
Age: The applicant's age in years.
-
Sex: Whether the applicant is female (F) or male (M).
-
approval_status: Whether the loan application was approved ("Yes") or not ("No").
-
Investment: Investments in stocks and mutual funds (in USD) declared by the applicant.
The lines of code below load the required libraries and the data.
library(tidyverse)
library(readr)
library(dplyr)
library(e1071)
library("ggplot2")
library("reshape2")
library("knitr")
dat <- read_csv("data.csv")
glimpse(dat)
Output:
Observations: 600
Variables: 9
$ Marital_status <chr> "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Is_graduate <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", ...
$ Income <int> 30680, 70210, 55880, 53450, 46800, 41270, 25710, 15223...
$ Loan_amount <int> 4350, 10400, 6650, 6450, 13500, 6300, 5550, 250000, 76...
$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
$ Age <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74, 74, 74...
$ Sex <chr> "M", "M", "M", "M", "M", "M", "F", "M", "F", "M", "M",...
$ Investment <dbl> 19942.0, 45636.5, 36322.0, 34742.5, 30420.0, 26825.5, ...
The above output shows that five variables are categorical (labeled as chr) while the remaining four are numerical (labeled as int). You need to convert the character variables to factor variables with the code below.
dat$Marital_status = as.factor(dat$Marital_status)
dat$Is_graduate = as.factor(dat$Is_graduate)
dat$Credit_score = as.factor(dat$Credit_score)
dat$approval_status = as.factor(dat$approval_status)
dat$Sex = as.factor(dat$Sex)
glimpse(dat)
Output:
Observations: 600
Variables: 9
$ Marital_status <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Ye...
$ Is_graduate <fct> Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Yes, N...
$ Income <int> 30680, 70210, 55880, 53450, 46800, 41270, 25710, 15223...
$ Loan_amount <int> 4350, 10400, 6650, 6450, 13500, 6300, 5550, 250000, 76...
$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Age <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74, 74, 74...
$ Sex <fct> M, M, M, M, M, M, F, M, F, M, M, M, F, F, F, M, M, M, ...
$ Investment <dbl> 19942.0, 45636.5, 36322.0, 34742.5, 30420.0, 26825.5, ...
The changes have been made and you are ready to summarize and analyze the data.
Summarizing Univariate Data
As a data scientist, you'll often be required to summarize individual variables in data. One of the most powerful ways to do this is through descriptive statistics, which includes measures of central tendency and measures of dispersion. Measures of central tendency include mean, median, and mode, while the measures of variability include standard deviation, variance, and the interquartile range (IQR). Some of these measures are briefly explained below.
-
Mean: the arithmetic average of the data
-
Median: the middle most value of a variable, which divides the data into two equal halves
-
Mode: the most frequent value of a variable and the only central tendency measure that can be used with both numeric and categorical variables
-
Standard deviation: quantifies the amount of variation from the mean of a set of data values
The lines of code below calculate the mean, median, and standard deviation of the Income and Loan_amount variables, respectively.
# Income
print(mean(dat$Income))
print(median(dat$Income))
print(sd(dat$Income))
# Loan_amount
print(mean(dat$Income))
print(median(dat$Loan_amount))
print(sd(dat$Loan_amount))
Output:
1] 70554.13
[1] 50835
[1] 71142.18
[1] 70554.13
[1] 7600
[1] 72429.35
The above code calculates the mean, median, and standard deviation. To find the mode, create the frequency table of the categorical variable, as shown in the code below.
table(dat$Credit_score)
Output:
Not _satisfactory Satisfactory
128 472
The output shows that the mode of the variable Credit_score is 472. This represents the count of the most frequent label, Satisfactory.
Summarizing Multiple Variables
In the previous section, you used descriptive statistics to summarize univariate variables. However, often you will want to summarize multiple variables together. For example, you might want to compute the mean of all the numerical variables in one line of code. This can be done with the sapply() function as shown below.
sapply(dat[,c(3,4,7,9)], mean)
Output:
Income Loan_amount Age Investment
70554.13 32379.37 49.45 16106.70
The other method is to use the summary() function, which will print the summary statistic of all the variables. The line of code below performs this operation.
summary(dat)
Output:
Marital_status Is_graduate Income Loan_amount
No :209 No :130 Min. : 3000 Min. : 1090
Yes:391 Yes:470 1st Qu.: 38498 1st Qu.: 6100
Median : 50835 Median : 7600
Mean : 70554 Mean : 32379
3rd Qu.: 76610 3rd Qu.: 13025
Max. :844490 Max. :778000
Credit_score approval_status Age Sex Investment
Not _satisfactory:128 No :190 Min. :22.00 F:111 Min. : 600
Satisfactory :472 Yes:410 1st Qu.:36.00 M:489 1st Qu.: 7940
Median :51.00 Median : 10674
Mean :49.45 Mean : 16107
3rd Qu.:61.00 3rd Qu.: 16872
Max. :76.00 Max. :346658
The above output prints the important summary statistics of all the variables, including the mean, median (50%), minimum, and maximum values. We can calculate the IQR using the first and third quartile values.
Sometimes you'll want to understand a statistic using a combination of two or more categories. For example, you might want the mean of the numerical variables representing the gender of applicants and approval status. This can be done using the code below. The first line of code uses the aggregate() function to create a table of the means of all the numerical variables across the two categorical variables, Sex and approval_status. The second line of code prints the output.
agg = aggregate(dat[,c(3,4,7,9)], by = list(dat$Sex, dat$approval_status), FUN = mean)
agg
Output:
Group.1 Group.2 Income Loan_amount Age Investment
1 F No 544824 228027 44.16 132583.8
2 M No 734543 353334 50.32 158825.1
3 F Yes 646274 256114 51.55 157135.4
4 M Yes 723086 335793 49.17 166090.2
The interesting inference from the above table is that the female applicants whose loan application was approved had significantly higher incomes, ages, and investment values compared to the female applicants whose applications were not approved. This inference can be useful in building machine learning models.
Probability
In simple terms, probability can be defined as the extent to which an event is likely to occur and is measured by the ratio of the favorable cases to the total number of cases possible. For example, the probability of randomly picking a red ball from a box containing three red and seven blue balls is 0.3. This is arrived by dividing the total number of favorable cases, which is three in this example, with the total number of possible cases, which is ten.
You can apply this simple logic to calculate the probability of loan approval in the data. The table() function in the first line of code below gives the frequency distribution of approved (denoted by the label "Yes") and rejected (denoted by the label "No") applications. The second line of code uses the logic explained above to calculate the probability of a loan application getting approved.
table(dat$approval_status)
410/(410+190)
Output:
1] 0.6833333
You can also perform the above step by using the code below.
prop.table(table(dat$approval_status))
Output:
No Yes
0.3166667 0.6833333
Conditional Probability
An important probability application in data science is to compute conditional probability. A conditional probability is the probability of an event A occurring when a secondary event B has already occurred. Mathematically, it is represented as P(A | B), and is read as "the probability of A given B."
In this dataset, you may want to estimate the probability that a randomly selected application was approved given that the applicant was at least 40 years old. This is an example of conditional probability and can be calculated using the code below.
dat %>%
summarize(prob = sum(Age >= 40 & approval_status == "Yes", na.rm = TRUE)/sum(Age >= 40, na.rm = TRUE))
Output:
prob
<dbl>
1 0.684
You can see that the probability comes out to be 0.68. This means that if you randomly select a record from the data, the probability is 68 percent that the applicant was at least 40 years old and the application was approved.
You can repeat this for two categorical variables as well. For example, you may want to estimate the probability that a randomly selected application was approved given that the applicant's credit score was not satisfactory. The lines of code below will compute this probability.
dat %>%
summarize(prob = sum(Credit_score == "Not _satisfactory" & approval_status == "Yes", na.rm = TRUE)/sum(Credit_score == "Not _satisfactory", na.rm = TRUE))
Output:
prob
<dbl>
1 0.296875
The output above shows that the conditional probability that a loan application will be approved even if the credit score is not satisfactory is 29.7 percent. This insight can be useful to inform a risk management policy.
Conclusion
In this guide, you learned about the fundamentals of summarizing data for univariate and multivariate analysis. You also learned how to compute probabilities and conditional probabilities that'll help in understanding the data and generating meaningful insights.
To learn more about data science with R, please refer to the following guides: