Data Manipulation with Dplyr
For performing manipulations in R, the dplyr package comes to the rescue. It provides a set of functions for data manipulation activities.
Oct 23, 2020 • 8 Minute Read
Introduction
When working with data, it's important to know what you want to do with the substantial amount of information you have. To figure out the facts from the data, some level of manipulation is necessary, as it is rare to get the data in exactly the right form.
For performing manipulations in R, the dplyr package comes to the rescue. It provides a set of functions for data manipulation activities.
# The easiest way to install dplyr is to install the whole tidyverse
install.packages("tidyverse")
# Alternate way to install only dplyr
install.packages("dplyr")
Overview
of dplyr
dplyr provides a consistent set of functions to solve data manipulation problems. Some of these include:
- filter(): to select records based on their values
- arrange(): to reorder
- select(): to select variables from the dataset
- mutate(): to add new variables
- summarize(): to condense multiple values into one
- group_by(): to break down the dataset into specified groups of rows
Piping Operator
Every data manipulation activity would not simple that will use one or two functions. To make the code easy to read, dplyr uses %>% (the piping operator) from magrittr, which turns x%>%f(x) into f(x, y). We'll be using this operator to help explain dplyr.
Manipulating Data
To explore the functions of dplyr, we need a dataset. We will use the flights dataset from the nycflights13 package, which contains several useful datasets.
# Installing nycflights13 package
install.packages("nycflights13")
library(nycflights13)
data(flights)
# looking into sample data
head(flights)
# A tibble: 6 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN
5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL
6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD
# ... with 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Filtering rows
To get a subset of rows from the main dataset, use the filter() function. This function takes dataframe as the first argument, and subsequent arguments are the expression that filters the dataframe.
# loading dplyr library
library(dplyr)
# filtering flights dataframe on year
filter(flights, year ==2013)
# adding more expression for year and month
filter(flights, year = 2013, month = 1)
The filter expressions are applied as an AND operation.
Arranging Rows
The arrange() function works similarly to filter() except it arranges the rows of the dataframe. The first argument of the function is the dataframe name and the subsequent arguments are the column names.
# arranging the flights dataframe on year
arrange(flights, year)
# arranging dataframe on multiple columns
arrange(flights, year, month, day)
# The default order of arrange() function is ascending if we want, we can arrange in descending order
arrange(flights, desc(year))
When providing multiple columns, each column breaks the arrangement of the rows of the preceding one.
Selecting Columns
If you are interested only in a few columns of the dataset, pull them using the select() function. The first argument ise the name of the dataframe and subsequent arguments are the names of the columns or expressions.
# selecting columns through their names
select(flights, year, month, day)
# selecting columns with expressions
select(flights, starts_with("arr"))
Adding New Columns
To add a new column that is a function of the existing columns in the dataframe, use the mutate() function. The first argument of the mutate function is the name of the dataframe and the subsequent arguments are the formula for the new columns.
# The flights dataset has distance and air_time, so we will add speed in the dataframe
mutate(flights, speed = distance/air_time)
This will create a new column, speed, in the flights dataframe.
Summarizing Values
Summarized data helps make decisions that are difficult to decide properly going through a huge amount of information. The summarize function helps in this scenario.
# lets get mean of the delay time in arrival from flights dataframe.
summarize(flights, delay = mean(arr_delay, na.rm = TRUE)
The summarize() function is used a lot with the group_by() function as it gives more detailed information when used along with group_by().
Grouped Operations
In grouped operations, the dataset breaks down into specified groups of rows. In dplyr, this is done with the group_by() function. The first argument of the function is the data frame name and subsequent arguments are those columns that take part in grouping the rows. We generally use the group_by() function along with some aggregate functions.
# Grouping dataset in years
group_by(flights, year)
# Using along with summarize() function
summarize(group_by(flight, year), delay = mean(arr_delay, na.rm = TRUE)
When used in the summarize() function, the result is the mean of arr_delay for each year.
Piping
So far, we have applyied the data manipulation function alone, but in this section, we will take on a task in which we have to use multiple functions and club them with the help of the piping(%>%) operator.
Say you need to calculate the mean delay in arrival and departure for every month of the year 2013 from the flights dataset. If you read the statement, it looks complicated. So you will break it in the code section.
# Using functions along with %>% operator
flight%>%
filter(year = 2013)%>% # First filter the required rows
select(year, month, arr_delay, dep_delay)%>% # Second selecting the necessary columns
group_by(year, month)%>% # Third grouping the rows
summarise( arrival = mean(arr_delay, na.rm = TRUE),
departure = mean(dep_delay, na.rm = TRUE)) # Fourth now calculating the columns
The code above uses the %>% operator to give you a better understanding of the code.
Conclusion
Whenever we are working with data, we need to do some manipulations to get the most valuable information. In real-life scenarios, datasets are more complicated and contain a lot of errors, so we have to write code that can manipulates data efficiently and tackles the errors.
Also, the volume of data in real-life cases is much higher. We can handle this if we understand the task and break it into small functions.
You can get more information about dplyr functions here.