Exploring your First Data Set with R
Jun 11, 2019 • 10 Minute Read
Introduction
Of all the data brass tacks in R, this is the brassiest. Or the tackiest. When you're first digging into a data set, where do you start? Great question. In this guide, we'll imagine that you've already read in your data set, have some column names, and don't have a ton of prior knowledge or business context. We do, however, assume that you have the patience to read plots and aren't scared of basic stats.
We'll work with the awesome nycflights data set and the tidyverse, which is an impressive series of packages that make R data work easier and more robust. Here, we'll show one way to do some basic exploratory data analysis (EDA) centered around improving on-time flight performance. As is common with EDA, here we'll focus on hypothesis generation. For maximum learning, follow along with a data set that you are excited about.
For those who are curious, we'll build this via RStudio and R Markdown.
If following along, first run install.packages(c('nycflights','tidyverse','skimr'))
library(tidyverse)
library(nycflights13)
library(skimr)
Data Format and Types
When first working with new data in the tidyverse, it's great to make sure you're working with a tibble (which is an especially responsible kind of data frame). If you're unsure, wrap your data frame in a tibble() like this.
d <- as_tibble(flights)
As a first sanity check, look at your data (see below). Do the number of rows (Observations) and columns (Variables) sound right? What data type is each column using? Note the <int>. Do they make sense? If you see a continuous column encoded as a <chr> (i.e., a character) you have trouble (see below for the fix).
glimpse(d)
Note that these look broadly correct. The non-numeric columns are encoded as <chr> and the time_hour column is encoded as <dttm>, which is great. Note that the flight (flight number) column is encoded as an int, when it's actually a discrete field - let's convert it.
d <- d %>%
mutate(flight = as.character(flight))
Before digging into the analysis itself, the next order of business is making sure that the data looks broadly correct. Many data pipelines sit idle for extended periods and issues creep in. Let's validate that the data is worth our time.
Missingness
Whenever starting work with a new data set, it's imperative to know how much of it is actually there. Are significant parts of certain columns simply missing? Depending on your business, missing data could be coded as NA, 99999, '', etc. This is why it's helpful to simply look at your data! For our example here, let's say that missing values are coded as NA. Let's use base R to find what percent of each column is NA.
colMeans(is.na(d))
Note that a few columns have 2% or more of their data missing.
- If you're going to be relying on this data set as part of a large project/pipeline, it'd be worth finding out why certain columns are missing significant chunks of data.
- Depending on the use case, the strategy for how to deal with this missingness will vary. If you're doing ML, for example, there are entire packages dedicated to this.
With our business problem centered around on-time performance, it's concerning that arr_time, arr_delay, and dep_delay are missing data. Since this isn't a ton of missingness for exploratory work, we'll choose to ignore for now. If we find it impacting the analysis, we'll revisit.
Note: there are a few types of missing data
Summary Statistics
For the data that is present, we need to confirm that it's what we'd roughly expect. Does our prior understanding of the business broadly match the data? If not, it's a good opportunity to update our mental model or fix the data pipeline. How do we check? There are myriad ways, but an easy way is via the skimr package. Note that this also shows missingness, and can take care of many early data set exploration needs.
d %>%
skimr::skim()
Okay, so what looks odd in those statistics? Well, the median (p50) arrival delay is -5, which seems a bit odd, but it turns out that airlines build in giant buffers such that their on-time numbers look quite good (i.e., technically, flights are mostly early). Also note the mean arrival time (arr_time) is 1502/302PM, which sounds about right - it's notoriously difficult to arrive anywhere in the morning.
Notice that the year column has sd (standard deviation) of 0, which means the data is just from 2013. Let's remove that column.
d <- d %>%
select(-year)
Now, summary statistics are helpful but they can also hide anomalies. See Anscombe's quartet. It's fantastic that skim provides histograms, but even they can hide issues.
Visualization
To fully appreciate your data, you must plot it (often in many ways). Again, here we do this to generate hypotheses around on-time performance. We're not going to build a model in this guide, but finding some potential drivers of on-time performance would be a good start.
Numerical Drivers of Flight Delays
In any data set, you have to split your visualization work for your numerical vs categorical variables (since plot types differ). The scatter plot is an awesome tool for examining the relationship between two continuous variables. Here we'll plot several of these to examine potential drivers of arrival delays. We'll use the ggplot2 and tidyr packages, which are part of the tidyverse.
d %>%
select_if(is.numeric) %>%
tidyr::gather(-arr_delay, key = "var", value = "value") %>%
ggplot(aes(x = value, y = arr_delay)) +
facet_wrap(~ var, scales = "free") +
geom_point() +
stat_smooth()
Note that the blue line is the smoothed trend. While the relationships aren't as clear as we might like, we can see that arrival delay (on the y-axis) has a relationship with dep_time (departure time) - it looks like flights reset around 5AM and delays become more common later in the day. This is seen in several of the other variables as well (since they relate to dep_time). Of course, departure delays are very closely related to arrival delays (especially for longer departure delays).
Categorical Drivers of Delay
Now let's look at the categorical columns and determine how they relate to arrival delays (arr_delay). Note that it's difficult to handle columns with many categorical columns visually. Here we look at the Variable type:character results from the top of skim results above and grab those variables with fewer than 20 unique values (shown in the n_unique column). That includes carrier (i.e., airline) and origin (i.e., airport of flight origin).
d %>%
# Only grab columns with a few categories and numerical col of interest
select(carrier, origin, arr_delay) %>%
gather(-arr_delay, key = "var", value = "value") %>%
ggplot(aes(x = value, y = arr_delay)) +
facet_wrap(~ var, scales = "free") +
geom_boxplot()
Here we see that there are notable differences in arrival delays by both airline and origin airport. Good work HA (Hawaiian Airlines)! If you work in aviation, such plots could generate several hypotheses about where you're data work should go next.
Note, if you want to leverage categorical columns with more than ~20 unique values, a different route is required. Let's grab flights that are more than 120 min late.
d_late <- d %>%
filter(arr_delay > 120)
Now let's analyze the data in terms of contingency tables which present the proportions (percentages) of each category combination of categories (from here). These allow you to incorporate columns with lots of categorical variables (destination) to determine where delays are happening most frequently
Ah, it looks like the EWR flight to ATL on DL (Delta) and EV (Express Jet) are particularly prone to delays of more than 120 minutes. What's happening there? Note, this table doesn't spoon-feed the analysis - the other airlines shown may not even have flights from EWR to ATL.
It's left as an exercise to determine a better contingency table for your needs
Note that this code can produce a ton of output, and we've only shown a snippet. Of course, you can control this by which variables you include in your table function call above.
Conclusion
Now that we've come up with some hypothesis, next would come the modeling, experimentation, and overall hypothesis testing phase. That deserves its own guide.
Even without much business context, you can quickly explore new data sets using the above code and generate a number of new hypotheses. Since we focused on a tabular data format and used both categorical and numerical columns, this should allow you to interrogate ~99% of business data sets. Happy computing!