Exploring Data with Quantitative Techniques Using R
Exploratory data analysis is one of the most time consuming but important parts of the data science cycle. Learn techniques available in R for performing exploratory data analysis using quantitative methods.
Nov 7, 2019 • 13 Minute Read
Introduction
Exploratory data analysis is one of the most time consuming but important parts of the data science cycle. In this guide, you will learn techniques available in R for performing exploratory data analysis using quantitative methods.
Let’s start with some data.
Data
In this guide, we will use a fictitious dataset of loan applicants containing 200 observations and 10 variables, as described below:
-
Marital_status: Whether the applicant is married ("Yes") or not ("No")
-
Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")
-
Income: Annual Income of the applicant (in USD)
-
Loan_amount: Loan amount (in USD) for which the application was submitted
-
Credit_score: Whether the applicants credit score is good ("Good") or not ("Bad")
-
Approval_status: Whether the loan application was approved ("Yes") or not ("No")
-
Age: The applicant's age in years
-
Gender: Whether the applicant is a male ("M") or a female ("F")
-
Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant
-
work_exp: Work experience in years
Let's start by loading the required libraries and the data.
library(plyr)
library(readr)
library(ggplot2)
library(GGally)
library(dplyr)
library(mlbench)
dat <- read_csv("data_test.csv")
glimpse(dat)
Output:
Observations: 200
Variables: 10
$ Marital_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Is_graduate <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
$ Income <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
$ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
$ Credit_score <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad,...
$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
$ gender <fct> Female, Female, Female, Female, Female, Female, Female...
$ age <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
$ work_exp <dbl> 9.0, 8.0, 10.0, 9.5, 9.0, 7.0, 6.0, 9.0, 9.0, 11.0, 9....
The output shows that the dataset has five categorical variables (labeled as 'chr'), four integer variables (labeled as 'int'), and one numeric variable, 'work_exp'. We will convert the categorical variables into 'factor' type using the line of code below.
names <- c(1,2,5,6,8)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
Output:
Observations: 200
Variables: 10
$ Marital_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Is_graduate <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
$ Income <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
$ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
$ Credit_score <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad,...
$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
$ gender <fct> Female, Female, Female, Female, Female, Female, Female...
$ age <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
$ work_exp <dbl> 9.0, 8.0, 10.0, 9.5, 9.0, 7.0, 6.0, 9.0, 9.0, 11.0, 9....
Let’s now perform the exploratory analysis using the various quantitative techniques.
Count, Frequency, and Proportions
The basic analysis would be to compute simple calculations such as count, frequency, and proportions. The summary() function can be used to perform this task.
summary(dat)
Output:
Marital_status Is_graduate Income Loan_amount Credit_score
Divorced:60 No : 84 Min. :32000 Min. : 12200 Bad :163
No :76 Yes:116 1st Qu.:56000 1st Qu.: 61375 Good: 37
Yes :64 Median :64000 Median : 77250
Mean :62750 Mean : 349278
3rd Qu.:72000 3rd Qu.: 144750
Max. :88000 Max. :6670000
approval_status Investment gender age work_exp
No :149 Min. : 6000 Female:178 Min. :32.00 Min. : 4.000
Yes: 51 1st Qu.: 81945 Male : 22 1st Qu.:33.00 1st Qu.: 7.000
Median : 114800 Median :34.00 Median : 8.000
Mean : 169694 Mean :33.98 Mean : 7.844
3rd Qu.: 173492 3rd Qu.:34.00 3rd Qu.: 9.000
Max. :3466580 Max. :49.00 Max. :11.000
We can infer the following from the above output:
1: There are no 'NA', or missing values, in any of the variables. This confirms that the count of observations for all the variables is 200.
-
The summary statistics of the numeric variables are displayed. For example, the age of the applicants ranges from 32 to 49 years.
-
For qualitative variables, the frequency of each label in the respective variable is displayed. For example, the frequency table of the target variable 'approval_status' shows that out of 200 applicants, 149 had their loan applications rejected, while the remaining were accepted.
Proportions
When dealing with a frequency table, it is often necessary to compute proportions. In our case, we might want to calculate proportions of a categorical variable, such as 'approval_status'. The lines of code below create a frequency table and the proportions of the label for the variable 'approval_status'.
table1 = table(dat$approval_status)
prop.table(table1)
Output:
No Yes
0.745 0.255
The output shows that around 75 percent of the applicants were rejected, while the acceptance rate was around 25 percent. This can be extended for two or more variables as well. For example, if we want to analyze the percentage of approved applications across credit scores, that is also possible.
The first line of code below creates the two-way frequency table, while the second line prints the cell percentages, which means the total sum of all cells will be equal to one. As expected, the majority of applicants with poor credit scores had their applications rejected.
The prop.table function can also be used to generate percentages for rows or columns. The third and fourth lines of code below create a proportion table, which gives row and column percentages, respectively.
table2 = table(dat$approval_status, dat$Credit_score)
prop.table(table2)
prop.table(table2, 1)
prop.table(table2, 2)
Output:
Bad Good
No 0.700 0.045
Yes 0.115 0.140
Bad Good
No 0.93959732 0.06040268
Yes 0.45098039 0.54901961
Bad Good
No 0.8588957 0.2432432
Yes 0.1411043 0.7567568
Descriptive Statistics
In the previous section, we carried out preliminary data analysis. However, any quantitative analysis is incomplete without descriptive statistics. They are the foundation of data science and are defined as the measures that summarize given data. They include measures of central tendency (such as mean, median, and mode) and measures of dispersion (such as standard deviation, variance, and range).
The summary() function provides some of these statistics but not all of them. There are many other functions in R, that can be used to identify these measures. The fivenum() function is one; it returns the min, lower-hinge, median, upper-hinge, and max values. However, this function has a limitation in that it is not meaningful for factors.
fivenum(dat$Income)
Output:
1] 32000 56000 64000 72000 88000
A better alternative to both the summary and the fivenum functions is the psych package, which provides several quantitative measures, including the standard deviation, skewness, and range.
The first line of code below loads the library, while the second line uses the 'describe' function in the library to print quantitative measures of all the variables.
library(psych)
describe(dat)
Output:
vars n mean sd median trimmed mad min
Marital_status* 1 200 2.02 0.79 2 2.02 1.48 1
Is_graduate* 2 200 1.58 0.49 2 1.60 0.00 1
Income 3 200 62750.00 10638.77 64000 63062.50 11860.80 32000
Loan_amount 4 200 349278.00 741366.91 77250 170134.38 35211.75 12200
Credit_score* 5 200 1.19 0.39 1 1.11 0.00 1
approval_status* 6 200 1.25 0.44 1 1.19 0.00 1
Investment 7 200 169693.55 267134.29 114800 130429.69 65590.22 6000
gender* 8 200 1.11 0.31 1 1.01 0.00 1
age 9 200 33.98 1.76 34 33.71 1.48 32
work_exp 10 200 7.84 1.33 8 7.88 1.48 4
max range skew kurtosis se
Marital_status* 3 2 -0.03 -1.40 0.06
Is_graduate* 2 1 -0.32 -1.91 0.03
Income 88000 56000 -0.25 0.08 752.27
Loan_amount 6670000 6657800 4.53 28.55 52422.56
Credit_score* 2 1 1.61 0.60 0.03
approval_status* 2 1 1.12 -0.76 0.03
Investment 3466580 3460580 9.71 113.87 18889.25
gender* 2 1 2.47 4.14 0.02
age 49 17 4.63 31.04 0.12
work_exp 11 7 -0.25 0.08 0.09
The high skewness value for the variables 'Loan_amount' and 'Investment’ indicate that these variables are not normally distributed.
Correlation
We can use the syntax cor(X, Y) or rcorr(X, Y) to generate correlations between the two numerical columns.
cor(dat$Income, dat$Investment)
Output:
1] 0.06168653
The value of 0.06 indicates weak correlation between the two variables.
Summarizing Data
In quantitative analysis, we are often required to summarize the data based on a few variables. This can easily be done with the summarize() function in the dplyr package. The code below summarizes and prints the average values of the three numerical variables.
dat %>%
summarize(avg_income = mean(Income,na.rm=TRUE),
avg_loan = mean(Loan_amount,na.rm=TRUE),
avg_age = mean(age,na.rm=TRUE))
Output:
A tibble: 1 x 3
avg_income avg_loan avg_age
<dbl> <dbl> <dbl>
1 62750 349278 34.0
The above output shows that the average age of the applicants is 34 years, and the average income is $62,750. The average loan amount applied for is $34,9278.
Grouping Variables
Another useful way of analyzing data is by using the group_by() function, which groups the input data based on a single column or multiple columns, then manipulates each such group of data. The line of code below groups the data frame by the variable ‘approval_status’, then calculates the average income for each label.
dat %>%
group_by(approval_status) %>%
summarise(mean_inc = mean(Income))
Output:
A tibble: 2 x 2
approval_status mean_inc
<fct> <dbl>
No 62174.
Yes 64431.
The output above shows that the mean income of applicants whose loans were approved is slightly higher than that of the rejected applicants.
Inferential Statistics
We have covered several techniques of quantitative analysis. There is also a field in statistics called Inferential Statistics that can be used for quantitative exploratory analysis. This is an extremely broad topic, and it is covered at length in another guide:
Hypothesis Testing - Interpreting Data with Statistical Models
Conclusion
In this guide, you have learned techniques for performing exploratory data analysis using quantitative methods. You have learned how to use in-built 'R' functions and the analysis techniques of the powerful 'dplyr' package.
To learn more about data science in R, please refer to the following guides: