Importing Data from Relational Databases in R
Jan 13, 2020 • 7 Minute Read
Introduction
R is great for statistics. You need data—lots of data—for better statistical results. Relational databases are great at storing lots of data.
As the saying goes, use the right tool for the right job.
If your job is to analyze lots of data, then it makes much sense to import data, get data, grab data, fetch data ... whatever you prefer to call it, just let your R script use the good stuff from that database!
How Do You Connect to the Database?
To import data into your R script, you must connect to the database. Fortunately, there are dedicated R packages for connecting to most popular databases out there. Even more good news: such packages have simple names.
- RSQLite for connecting to a SQLite database
- RPostgreSQL for connecting to a PostgreSQL database
I like this convention: just put R in front of the database software name, cross your fingers, and you are good to go.
Let's connect to a SQLite database from the R interpreter. Which package do we need for that?
> install.packages('RSQLite')
> library(RSQLite)
> con <- dbConnect(SQLite(), 'play-example.db')
> con
<SQLiteConnection>
Path: C:\Users\dan\Documents\play-example.db
Extensions: TRUE
>
The first line installs the RSQLite package (I skipped the boring installation message). The second line loads the RSQLite package, as expected. The connection magic happens on the third line by calling the dbConnect() function with two arguments:
- The first argument is the SQLite() function, which creates a driver object for SQLite under the hood.
- The second argument is the name of the database. For SQLite, it's the actual file name, and it gets created if the file is not already in place.
It makes sense to store the connection into a variable. The newly created connection shows the file path of the SQLite database. Of course, expect to get a different path if you try it on your machine. By the way, can you please check the size of the SQLite file on your machine? What is the explanation?
Show Me the Tables
We can expect a newly created SQLite database to be empty. Let's use some R code to check and confirm that.
> dbListTables(con)
character(0)
>
Ok, no tables yet, but at least the dbListTables() function has a nice, intuitive name. Let's create a new table with the classic mtcars data frame.
> dbWriteTable(con, 'cars', mtcars)
> dbListTables(con)
[1] "cars"
>
Excellent! Now we can play with the cars table.
Show Me the Data with SQL
Let's use some SQL to get 3 rows from the cars table:
> dbGetQuery(con, 'SELECT * FROM cars LIMIT 3')
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
>
How can we get mpg and number of cylinders for only cars with mpg higher than 30, ordered by mpg in ascending order?
>> dbGetQuery(con, 'SELECT mpg, cyl
FROM cars
WHERE mpg>30
ORDER BY mpg')
mpg cyl
1 30.4 4
2 30.4 4
3 32.4 4
4 33.9 4
>
Of course, you can use more sophisticated queries to refine results.
Show Me the Data with dplyr
If you are not in the mood to write SQL queries, then use dplyr to generate SQL code automatically for you. The dplyr package is great for working with data frames. Together with dbplyr, it enables you to work with a table as if it were a typical data frame.
Let's install prerequisites and connect to the database:
> install.packages(c('dplyr', 'dbplyr'))
> library(dplyr)
> library(RSQLite)
> con <- dbConnect(SQLite(), 'play-example.db')
> cars <- tbl(con, 'cars')
>
Look at line 4: we use dbConnect() again. Pay attention to the first argument: it's SQLite(), not just SQLite. That's very easy to miss and it happened to me a few times. On line 5, the tbl() function takes care of returning data from the cars table through the connection.
Let's look at how dplyr can help us:
> cars %>% select(mpg)
# Source: lazy query [?? x 1]
# Database: sqlite 3.30.1 [C:\Users\dan\Documents\play-example.db]
mpg
<dbl>
1 21
2 21
3 22.8
4 21.4
5 18.7
6 18.1
7 14.3
8 24.4
9 22.8
10 19.2
# … with more rows
>
What is this?
Well, %>% is about piping data from left to right, from cars to the select() function that returns the mpg column. Read it as then for convenience: cars, then select mpg, and so on.
Let's try another one.
Again, how can we get mpg and number of cylinders for only cars with mpg higher than 30, ordered by mpg in ascending order?
> cars %>%
select(mpg, cyl) %>%
filter(mpg>30) %>%
arrange(mpg)
# Source: lazy query [?? x 2]
# Database: sqlite 3.30.1 [C:\Users\dan\Documents\play-example.db]
# Ordered by: mpg
mpg cyl
<dbl> <dbl>
1 30.4 4
2 30.4 4
3 32.4 4
4 33.9 4
>
Look at the first line: cars, then select mpg and cyl, then filter by mpg larger than 30, then arrange by mpg. Sounds quite natural, right? Also, results are the same as using the SQL query.
By the way, did I mention that dplyr generates SQL automatically? Let's use the show_query() function to look at the generated SQL.
> cars %>%
select(mpg, cyl) %>%
filter(mpg>30) %>%
arrange(mpg) %>%
show_query()
<SQL>
SELECT *
FROM (SELECT `mpg`, `cyl`
FROM `cars`)
WHERE (`mpg` > 30.0)
ORDER BY `mpg`
>
This is very similar to the SQL query we wrote painstakingly in the previous section. It's great to have both options: manual SQL and dplyr!
Conclusion
Once connected to the database, you can import data using either SQL or dplyr and use that data further in your R code.
In this guide, we used SQLite to illustrate the main points of importing data from a relational database. These points are also applicable for working with other relational databases.
If you want to delve deeper into importing data from databases into your R code, then have a look at my Pluralsight course Importing Data from Relational Databases in R, which covers topics such as how to connect to the database with ODBC and how to secure credentials or troubleshoot connection issues.