Importing Data from a Database Using Python
Feb 5, 2019 • 9 Minute Read
Introduction
In previous guides, we have explained how to import data from Excel spreadsheets, tab-delimited files, and online APIs. As helpful as those resources are, if the data we need to handle is large in size or becomes too complex, a database (relational or otherwise) may be a better solution. Regardless of the flavor you choose, there are general principles and basic tools that we can use to query a database and manipulate the results using Python.
Prerequisites
To begin, we need to install the appropriate connector (also known as driver) for the database system that we are using. This utility comes in the form of a module that is at one's disposal either from the standard library (such as sqlite3) or a third-party package like mysql-connector-python and psycopg2-binary for Mysql / MariaDB and PostgreSQL, respectively. In any event, the Python Package Index is our go-to place to search for available adapters.
In this guide, we will use PostgreSQL, since it provides a function called ROW_TO_JSON out of the box. As its name suggests, this function returns each row in the result set as a JSON object. Since we have already learned how to work with JSON data, we will be able to manipulate the result of a query very easily.
If we use a different database system, we can iterate over the results and create a list of dictionaries where each element is a record in the result set.
That being said, we will need to install psycopg2-binary, preferably inside a virtual environment before we proceed:
pip install psycopg2-binary
Now let's examine the PostgreSQL database we will work with, which is called nba. Figs. 1 through 3 show the structure of the coaches, players, and teams tables.
- coaches stores the following data, where coach_id acts as the primary key. Besides the coach's first and last names, there's also a team_id which is a foreign key that references the homonymous field in the teams table.
- players, besides the player_id (primary key) and team_id (foreign key, which indicates the team he is currently playing for), also holds the first and last names, the jersey number, the height in meters, the weight in kilograms, and the country of origin.
- Finally, teams are described by their name, conference, current conference rank, home wins and losses, and away wins and losses. Of course, it also has the team_id primary key that is referenced in the other two tables.
The next step consists in writing a SQL query to retrieve the list of teams ordered by conference and rank, along with the number of players in each team and the name of its coach. And while we're at it, we can also add the number of home and away wins:
SELECT
t.name,
t.city,
t.conference,
t.conference_rank,
COUNT(p.player_id) AS number_of_players,
CONCAT(c.first_name, ' ', c.last_name) AS coach,
t.home_wins,
t.away_wins
FROM players p, teams t, coaches c
WHERE p.team_id = t.team_id
AND c.team_id = t.team_id
GROUP BY t.name, c.first_name, c.last_name, t.city, t.conference, t.conference_rank, t.home_wins, t.away_wins
ORDER BY t.conference, t.conference_rank
We will then wrap the query inside the ROW_TO_JSON function for our convenience and save it to a file named query.sql in the current directory:
SELECT ROW_TO_JSON(team_info) FROM (
SELECT
t.name,
t.city,
t.conference,
t.conference_rank,
COUNT(p.player_id) AS number_of_players,
CONCAT(c.first_name, ' ', c.last_name) AS coach,
t.home_wins,
t.away_wins
FROM players p, teams t, coaches c
WHERE p.team_id = t.team_id
AND c.team_id = t.team_id
GROUP BY t.name, c.first_name, c.last_name, t.city, t.conference, t.conference_rank, t.home_wins, t.away_wins
ORDER BY t.conference, t.conference_rank
) AS team_info
Fig. 4 shows the first records of the above query. Note that each row has the structure of a Python dictionary where the names of the fields returned by the query are the keys.
Last, but not least, a word of caution. To connect to a database, we need a username and a password. It is best practice to use environment variables instead of exposing them in plain sight as part of the connection string. This is particularly important if you push your code to a version control system that other people can access. In Unix-like environments, this can be done by appending the following two lines at the end of your shell's initialization file. To apply changes, you will need to log out and log back in or source the file in the current session.
export DB_USER="your_PostgreSQL_username_here_inside_quotes"
export DB_PASS="your_password_inside_quotes"
In Windows, go to Control Panel / System / Advanced system settings. Select the Advanced tab and click on Environment Variables to add them:
We are now ready to start writing Python code!
Querying the Database and Manipulating Results
At the top of our program we will import the necessary modules and one function to handle errors:
import os
import psycopg2 as p
from psycopg2 import Error
Next, we will load the contents of query.sql into query and instantiate the connection. You can also use environment variables for host, port, and database just like we did for user and password, although it is not strictly necessary to do so.
with open('query.sql') as sql:
query = sql.read()
conn = p.connect(
user = os.environ['DB_USER'],
password = os.environ['DB_PASS'],
host = 'localhost',
port = '5432',
database = 'nba'
)
Once we have successfully connected to the database, it is time to execute the query. To do so, a control structure associated with the connection and known as cursor is used. If everything went as expected, the variable called result contains a list of one-element tuples where each element is a dictionary.
cursor = conn.cursor()
cursor.execute(query)
result = cursor.fetchall()
At this point, we can iterate over result and manipulate its contents as desired. For example, we may insert them into a spreadsheet (as illustrated in Fig. 5), as we learned in Importing Data from Microsoft Excel Files with Python, or use them to feed an HTML table via a web application.
To catch errors, if they occur, it is necessary to wrap our code inside a try-except block. And while we are at it, adding a finally sentence allows us to clean up the connection when we are done using it:
try:
# Instantiate the connection
conn = p.connect(
user = os.environ['DB_USER'],
password = os.environ['DB_PASS'],
host = 'localhost',
port = '5432',
database = 'nba'
)
# Create cursor, execute the query, and fetch results
cursor = conn.cursor()
cursor.execute(query)
result = cursor.fetchall()
# Create workbook and select active sheet
wb = Workbook()
ws = wb.active
# Rename active sheet
ws.title = 'Teams'
# Column headings
column_headings = [
'Name',
'City',
'Conference',
'Rank',
'Players',
'Coach',
'Home wins',
'Away wins'
]
ws.append(column_headings)
# Add players
for team in result:
ws.append(list(team[0].values()))
# Get coordinates of last cell
last_cell = ws.cell(row = ws.max_row, column = ws.max_column).coordinate
# Create table
team_table = Table(displayName = 'TeamTable', ref = 'A1:{}'.format(last_cell))
# Add 'Table Style Medium 6' style
style = TableStyleInfo(name = 'TableStyleMedium6', showRowStripes = True)
# Apply style to table
team_table.tableStyleInfo = style
# Add table to spreadsheet
ws.add_table(team_table)
# Save spreadsheet
wb.save('teams.xlsx')
except p.Error as error:
print('There was an error with the database operation: {}'.format(error))
except:
print('There was an unexpected error of type {}'.format(sys.exc_info()[0]))
finally:
if conn:
cursor.close()
conn.close()
Both the script and the SQL file are available in Github. Feel free to use and modify them as you need.
Summary
In this guide, we have learned how to connect to a PostgreSQL database using Python, execute queries, and import the results into an Excel spreadsheet. However, you can adapt the code to do other things, such as creating and saving a JSON file or populating an HTML table.