Manipulating Dataframes in R
Nov 7, 2019 • 12 Minute Read
Introduction
It is said that eighty percent of the time in a data science project is spent on data preparation and data cleaning. In this guide, you will learn about the tricks and techniques of manipulating dataframes in R using the popular package dplyr.
The 'dplyr' library offers several powerful functions to manipulate the dataframe, which is a two-dimensional data structure containing rows and columns.
In particular, you will learn the following data manipulation techniques:
- Filter
- Select
- Mutate
- Arrange
- Summarize
- Group_by
- Count 8: Rename
Let’s begin by loading the data.
Data
In this guide, we will be using fictitious data of loan applicants containing 600 observations and 10 variables, as described below:
-
Marital_status: Whether the applicant is married ("Yes") or not ("No")
-
Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")
-
Income: Annual Income of the applicant (in USD)
-
Loan_amount: Loan amount (in USD) for which the application was submitted
-
Credit_score: Whether the applicants credit score is good ("Good") or not ("Bad")
-
Approval_status: Whether the loan application was approved ("Yes") or not ("No")
-
Age: The applicant's age in years
-
Sex: Whether the applicant is a male ("M") or a female ("F")
-
Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant
-
Purpose: Purpose of applying for the loan
Let's load the required libraries and the data.
library(plyr)
library(readr)
library(ggplot2)
library(GGally)
library(dplyr)
library(mlbench)
dat <- read_csv("data_r2.csv")
glimpse(dat)
Output:
Observations: 600
Variables: 10
$ Marital_status <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ...
$ Is_graduate <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
$ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Sex <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",...
$ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
$ Purpose <chr> "Education", "Travel", "Others", "Others", "Travel", "...
The output shows that the data has six categorical variables (labeled as 'chr'), and four numerical variables (labeled as 'int'). We will convert the categorical variables into the 'factor' type, using the lines of code below.
names <- c(1,2,5,6,8,10)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
Output:
Observations: 600
Variables: 10
$ Marital_status <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No...
$ Is_graduate <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
$ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
$ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Sex <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, ...
$ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
$ Purpose <fct> Education, Travel, Others, Others, Travel, Travel, Tra...
The data is now ready for carrying out the various data manipulation steps.
Filter
The filter command selects rows based on the specified condition. We will start by filtering the data for applicants whose loan is approved. The first line of code below prints the table of the variable 'approval_status'. The output shows that there are 410 applicants whose loan was approved.
The second line creates a new dataframe, 'approved_loan', while the third line prints the dimension, which is 410 rows and 10 variables.
table(dat$approval_status)
approved_loan = dat %>% filter(approval_status == "Yes")
print(dim(approved_loan))
Output:
1] 410 10
One of the things to notice in the code above is the use of the pipe operator, written as ***%>%***. This pipe operator enables us to chain multiple 'dplyr' commands together, such that it takes output from one command and feeds it as input to the next command.
The 'filter' command can also be used to include multiple conditions. The code below filters the data using different conditions on two variables, ‘approval_status’ and ‘credit_score’.
aproved_satis <- dat %>% filter(approval_status == "Yes", Credit_score == "Satisfactory")
dim(aproved_satis)
Output:
1] 372 10
The above output shows that the resultant data now has 372 records instead of the original 600 because of the conditions we used.
The filter command can also be used with numerical variables, as shown in the lines of code below. The output confirms that the operation is completed.
income_age <- dat %>% filter(Income > 600000, Age >= 30)
dim(income_age)
summary(income_age$Income)
summary(income_age$Age)
Output:
1] 205 10
Min. 1st Qu. Median Mean 3rd Qu. Max.
606300 711100 843300 1100888 1274700 3173700
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.00 43.00 55.00 52.81 62.00 75.00
Select
The select() command selects columns in a data. The lines of code below provides the entire data as input to the ‘select’ function using the '%>%' operator. Inside the select function, the columns to be selected are specified.
The output shows that the resultant data has 600 observations and 3 variables.
dat_3 = dat %>% select(Marital_status, Age, Sex)
glimpse(dat_3)
Output:
Observations: 600
Variables: 3
$ Marital_status <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No,...
$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33,...
$ Sex <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, M...
Mutate
The mutate() function helps in feature engineering by allowing us to create new variables. For example, we can use two variables, 'Investment' and 'Income', to create a new variable, 'Inv_inc_ratio'. The mutate verb adds the new column to the dataframe, as shown in the syntax below.
dat_ratio = dat %>% mutate(Inv_inc_ratio = Investment / Income * 100)
dim(dat_ratio)
summary(dat_ratio$Inv_inc_ratio)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max.
20 20 70 55.69 70 70
Arrange
The arrange() function arranges the rows based upon the specified condition. For example, if we want to arrange the 'Inv_inc_ratio' variable in an order, we can do that using the lines of code below.
dat_ratio_2 = dat_ratio %>% arrange(Inv_inc_ratio)
head(dat_ratio_2$Inv_inc_ratio)
Output:
1] 20 20 20 20 20 20
By default, the 'arrange()' function orders the variable in ascending order. In order to sort and arrange in descending order, we add the function desc() to the code.
dat_ratio_2 = dat_ratio %>% arrange(desc(Inv_inc_ratio))
head(dat_ratio_2$Inv_inc_ratio)
Output:
1] 70 70 70 70 70 70
Multiple columns can also be included in the 'arrange()' function. In such a case, each column is used to break ties in the values of the preceding columns.
dat_4 = dat_ratio %>% select(approval_status, Age, Inv_inc_ratio)
dim(dat_4)
dat_ratio_4 = dat_4 %>% arrange(approval_status, Age, Inv_inc_ratio)
head(dat_ratio_4)
Output:
1] 600 3
A tibble: 6 x 3
approval_status Age Inv_inc_ratio
<fct> <int> <dbl>
No 22 70
No 23 50
No 23 70
No 23 70
No 24 70
No 24 70
Summarize
The summarize() function summarizes the variables in the dataframe. For example, if we want to get the mean value of variables like income, loan amount, or age, the line of code below will produce and display the desired output.
dat_ratio %>%
summarize(avg_income = mean(Income,na.rm=TRUE),
avg_loan = mean(Loan_amount,na.rm=TRUE),
avg_ratio = mean(Age,na.rm=TRUE))
Output:
A tibble: 1 x 3
avg_income avg_loan avg_ratio
<dbl> <dbl> <dbl>
658615 1455120 48.7
Group_by
The group_by() function groups the data based on one or more columns and then manipulates the grouped dataframe. The 'group_by()' function is often used along with the other five 'dplyr' commands discussed in the previous sections.
We will take one case of using the 'group_by()' and 'summarize()' commands together. The line of code below groups the data by the variable 'Purpose', and then calculates the summary of average income for each 'Purpose' group.
data %>%
group_by(Purpose) %>%
summarise(mean_inc = mean(Income))
Output:
A tibble: 5 x 2
Purpose mean_inc
<chr> <dbl>
Education 677996.
Home 581834.
Others 749483.
Personal 682385.
Travel 640854.
Count
The count() function counts the number of observations in a variable or the dataset. The first line of code below prints the count of the entire dataset, 'dat'. The second line extends this functionality further and provides the count of respective categories within the 'Purpose' variable.
dat %>% count()
dat %>% count(Purpose)
Output:
1 600
A tibble: 5 x 2
Purpose n
<fct> <int>
Education 99
Home 134
Others 69
Personal 178
Travel 120
We can also sort the above output in descending order, as shown in the line of code below.
dat %>% count(Purpose, sort = TRUE)
Output:
A tibble: 5 x 2
Purpose n
<fct> <int>
Personal 178
Home 134
Travel 120
Education 99
Others 69
Rename
The rename() function is used to rename one or more columns. If we want to change the name of the variable 'Inv_inc_ratio' and rename it 'investment_ratio', we can do that using the line of code below. The general syntax is ***rename(new name = old name)***.
newdat = dat_ratio %>% rename(investment_ratio = Inv_inc_ratio)
names(newdat)
Output:
1] "Marital_status" "Is_graduate" "Income" "Loan_amount"
[5] "Credit_score" "approval_status" "Age" "Sex"
[9] "Investment" "Purpose" "investment_ratio"
Conclusion
In this guide, you have learned about data manipulation techniques using the popular 'dplyr' library. These techniques will help you handle complex, real-world data sets.
To learn more about data science using 'R', please refer to the following guides: