Interpreting Data Using Descriptive Statistics with R
Aug 2, 2019 • 15 Minute Read
Introduction
Descriptive Statistics is the foundation block of summarizing data. It is divided into the measures of central tendency and the measures of dispersion. Measures of central tendency include mean, median, and the mode, while the measures of variability include standard deviation, variance, and the interquartile range. In this guide, you will learn how to compute these measures of descriptive statistics and use them to interpret the data.
We will begin by loading the data to be used in this guide.
Data
In this guide, we will be using the fictitious data of loan applicants containing 600 observations and 9 variables, as described below:
-
Marital_status: Whether the applicant is married ("Yes") or not ("No").
-
Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").
-
Income: Annual Income of the applicant (in USD).
-
Loan_amount: Loan amount (in USD) for which the application was submitted.
-
Credit_score: Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").
-
Age: The applicant’s age in years.
-
Sex: Whether the applicant is female (F) or male (M).
-
approval_status: Whether the loan application was approved ("Yes") or not ("No").
-
Investment: Investments in stocks and mutual funds (in USD), as declared by the applicant.
Let us start by loading the required libraries and the data.
library(readr)
library(dplyr)
library(e1071)
dat <- read_csv("data_de.csv")
glimpse(dat)
Output:
Observations: 600
Variables: 9
$ Marital_status <chr> "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", ...
$ Is_graduate <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", "...
$ Income <int> 306800, 702100, 558800, 534500, 468000, 412700, 257100,...
$ Loan_amount <int> 43500, 104000, 66500, 64500, 135000, 63000, 55500, 2500...
$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satisf...
$ approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Age <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74, 74, 74,...
$ Sex <chr> "M", "M", "M", "M", "M", "M", "F", "M", "F", "M", "M", ...
$ Investment <int> 199420, 456365, 363220, 347425, 304200, 268255, 167115,...
Five of the variables are categorical (labelled as 'chr') while the remaining four are numerical (labelled as 'int').
Measures of Central Tendency
Measures of central tendency describe the center of the data and are often represented by the mean, median, and mode.
Mean
Mean represents the arithmetic average of the data. It is calculated by taking the sum of the values and dividing by the number of observations. The mean() function is used to calculate this in R. If the variable contains missing values, the argument na.rm=TRUE must be added to the mean function, which will now ignore the missing values while computing the mean.
The line of code below uses the 'sapply function to calculate the mean of the numerical variables in the data. The argument c(3,4,7,9) selects the numerical variables as per their position in the data.
From the output, we can infer that the average age of the applicant is 49.5 years, the average annual income is USD 705,541, and the average investment is USD 161,066. The output also shows that the average loan applied for is USD 323,793.
sapply(dat[,c(3,4,7,9)], mean)
Output:
Income Loan_amount Age Investment
705541.33 323793.67 49.45 161066.97
It is also possible to calculate the mean of a variable in the data, as shown below.
print(mean(dat$Income))
print(mean(dat$Loan_amount))
Output:
1] 705541.3
[1] 323793.7
Median
The middle most value of a variable in a data is its median value. The line of code below uses the median() function to print the median of the numerical variables in the data.
sapply(dat[,c(3,4,7,9)], median)
Output:
Income Loan_amount Age Investment
508350 76000 51 106740
From the output, we can infer that the median age of the applicants is 51 years, the median annual income is USD 508,350, and the median loan applied for is USD 76,000.
It is also possible to calculate the median of a variable in the data, as shown in the first two lines of code below.
print(median(dat$Income))
print(median(dat$Loan_amount))
Output:
1] 508350
[1] 76000
Mode
Mode represents the most frequent value of a variable in the data and is the only central tendency measure that can be used with both numeric and categorical variables.
For finding mode in R, we need to convert the five 'chr' variables into the 'factor' variable. These five variables are 'Marital_status', 'Is_graduate', 'Credit_score', 'approval_status', and 'Sex'.
The first line of code below creates a list of columns that contain the above variables in the dataset. The second line uses the lapply function to convert these variables, stored in 'names', into the factor variables. The third line provides the information about the data.
names <- c(1,2,5,6,8)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
Output:
Observations: 600
Variables: 9
$ Marital_status <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, No, No,...
$ Is_graduate <fct> Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Yes...
$ Income <int> 306800, 702100, 558800, 534500, 468000, 412700...
$ Loan_amount <int> 43500, 104000, 66500, 64500, 135000, 63000, 55...
$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Sati...
$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y...
$ Age <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74...
$ Sex <fct> M, M, M, M, M, M, F, M, F, M, M, M, F, F, F, M...
$ Investment <int> 199420, 456365, 363220, 347425, 304200, 268255...
The output shows that all the five variables have been converted into the ‘factor’ variables. Now, we can print the label-wise frequency of each variable with the line of code below.
summary(dat[,c(1,2,5,6,8)])
Output:
Marital_status Is_graduate Credit_score approval_status Sex
No :209 No :130 Not _satisfactory:128 No :190 F:111
Yes:391 Yes:470 Satisfactory :472 Yes:410 M:489
The mode for the variable 'Marital_status' is the label 'Yes' which means majority of the applicants were married. Similarly, the mode for the variable 'Sex' is the label 'M', indicating that majority of the applicants were male.
It is also possible to calculate the mode of a variable in the data, as shown in the line of code below.
table(dat$Credit_score)
Output:
Not _satisfactory Satisfactory
128 472
Measures of Dispersion
The extent to which a distribution is stretched or squeezed is measured by dispersion, also referred to as variability, scatter, or spread. The most popular measures of dispersion are standard deviation, variance, and the interquartile range.
Standard Deviation
Standard deviation is a measure used to quantify the amount of variation of a set of data values from its mean. A low standard deviation for a variable indicates that the data points tend to be close to its mean, and vice versa. It is also used to examine if the data has a normal (or nearly normal) distribution. The line of code below prints the standard deviation of all the numerical variables in the data.
sapply(dat[,c(3,4,7,9)], sd)
Output:
Income Loan_amount Age Investment
711421.81415 724293.48078 14.72851 203058.62713
While interpreting the standard deviation values, it is important to understand them in conjunction with the mean. For example, the units of the variables 'Income' and 'Age' are different, therefore, comparing the dispersion of these two variables based on standard deviation alone will be incorrect. This needs to be kept in mind.
It is also possible to calculate the standard deviation of a variable, as shown in the lines of code below.
print(sd(dat$Income))
print(sd(dat$Loan_amount))
Output:
1] 711421.8
[1] 724293.5
Variance
Variance is the square of the standard deviation and the covariance of the random variable with itself. The line of code below prints the variance of all the numerical variables in the dataset. The interpretation of the variance is like that of the standard deviation.
sapply(dat[,c(3,4,7,9)], var)
Output:
Income Loan_amount Age Investment
5.061210e+11 5.246010e+11 2.169290e+02 4.123281e+10
IQR
The Interquartile Range (IQR) is calculated as the difference between the upper quartile (75th percentile) and the lower quartile (25th percentile). The IQR can be calculated using the IQR() function, as shown in the line of code below.
sapply(dat[,c(3,4,7,9)], IQR)
Output:
Income Loan_amount Age Investment
381125 69250 25 89315
Skewness
Skewness is a measure of symmetry, or the lack of it, for a real-valued random variable about its mean. The skewness value can be positive, negative, or undefined. In a perfectly symmetrical distribution, the mean, median, and the mode will all have the same value. However, the variables in our data are not symmetrical, resulting in different values of the central tendency.
The line of code below prints the skewness value for all the numerical variables.
skew_val <- apply(dat[,c(3,4,7,9)], 2, skewness)
print(skew_val)
Output:
Income Loan_amount Age Investment
5.31789378 4.98136968 -0.05525976 8.99320361
The skewness values can be interpreted in the following manner:
-
Highly skewed distribution: If the skewness value is less than −1 or greater than +1.
-
Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½ and +1.
-
Approximately symmetric distribution: If the skewness value is between −½ and +½.
Putting Everything Together
In the previous sections, we learned how to calculate the measures of central tendency and dispersion, individually. However, many of these measures can be calculated simultaneously, using the summary() function, which will print the summary statistics of all the variables. The line of code below performs this operation on the data.
summary(dat)
Output:
Marital_status Is_graduate Income Loan_amount
No :209 No :130 Min. : 30000 Min. : 10900
Yes:391 Yes:470 1st Qu.: 384975 1st Qu.: 61000
Median : 508350 Median : 76000
Mean : 705541 Mean : 323794
3rd Qu.: 766100 3rd Qu.: 130250
Max. :8444900 Max. :7780000
Credit_score approval_status Age Sex Investment
Not _satisfactory:128 No :190 Min. :22.00 F:111 Min. : 6000
Satisfactory :472 Yes:410 1st Qu.:36.00 M:489 1st Qu.: 79400
Median :51.00 Median : 106740
Mean :49.45 Mean : 161067
3rd Qu.:61.00 3rd Qu.: 168715
Max. :76.00 Max. :3466580
The above output prints the important summary statistics of all the variables like the mean, median (50%), minimum, and maximum values. We can calculate the IQR using the first and the third quartile values.
Summary Statistics using Multiple Variables
Sometimes we may want to understand a statistic using a combination of two or more categories. For example, understanding the ‘mean’ of the numerical variables using two or more categorical variables. .
The first line of code below uses the aggregate function to create a table of mean variables for all the numerical variables, across the two categorical variables, 'Sex' and 'approval_status'. The second line of code prints the output.
agg = aggregate(dat[,c(3,4,7,9)], by = list(dat$Sex, dat$approval_status), FUN = mean)
agg
Output:
Group.1 Group.2 Income Loan_amount Age Investment
1 F No 544824.3 228027.0 44.16216 132583.8
2 M No 734543.1 353334.0 50.32026 158825.1
3 F Yes 646274.3 256114.9 51.55405 157135.4
4 M Yes 723086.0 335793.5 49.17262 166090.2
The interesting inference from the output above is that the female applicants whose loan application was approved had a significantly higher income, age, and investment values, compared to the female applicants whose application was not approved. This inference can be useful for feature engineering.
Conclusion
In this guide, you have learned about the fundamentals of the most widely used descriptive statistics and their calculations with R, an extremely powerful statistical programming language. You have learned about the following topics in this guide: Mean Median Mode Standard Deviation Variance Interquartile Range Skewness