Importing Data from Microsoft Excel Files with Python
Nov 26, 2018 • 7 Minute Read
Introduction
By definition, data can be any alphanumeric value that represents a property of something. It does not, by itself, have any semantic meaning. Thus, it must be put into context to generate information - which, in turn, can drive decision making.
As the de-facto standard spreadsheet application, Microsoft Excel (or Excel for short) allows users to organize and process data into tables and charts. If combined with the bounties of object-oriented programming, business or data science professionals may use it to develop robust analysis tools.
In this guide, we will learn how to write to and read from a workbook using Python and the openpyxl module. Although there are others that provide similar functionality, this module includes both reading and writing capabilities, is simple to use, and is a free software released under the MIT license.
Writing to an Excel File
Before we proceed, you will need to install openpyxl using pip from the command line, preferably inside a virtual environment:
pip install openpyxl
Once complete, launch a Python shell and type the following:
from openpyxl import Workbook
Next, create a new Workbook instance:
wb = Workbook()
and save it in the current working directory as players.xlsx:
wb.save('players.xlsx')
So far, so good - but all we have done is create an empty spreadsheet, as we can see in Fig. 1:
Let's now explain, step by step, how to write data to the file.
- Import modules and classes:
from openpyxl import load_workbook, Workbook
from openpyxl.worksheet.table import Table, TableStyleInfo
- Define a function called str_to_int_or_float to convert strings to integers or floats if possible (booleans and plain strings are left the same). This will later allow us to insert numbers to the spreadsheet in the proper format.
def str_to_int_or_float(value):
if isinstance(value, bool):
return value
try:
return int(value)
except ValueError:
try:
return float(value)
except ValueError:
return value
- Load workbook, select active sheet and rename it:
wb = Workbook()
ws = wb.active
ws.title = 'Players info'
- Add the headings separately and then use nba_players as datasource:
ws.append(['First name', 'Last name', 'Jersey', 'Height [mts]', 'NBA debut year', 'Weight [kgs]'])
nba_players = [{'firstName': 'LeBron', 'lastName': 'James', 'jersey': '2', 'heightMeters': '2.03', 'nbaDebutYear': '2003', 'weightKilograms': '113.4'}, {'firstName': 'LaMarcus', 'lastName': 'Aldridge', 'jersey': '12', 'heightMeters': '2.11', 'nbaDebutYear': '2006', 'weightKilograms': '117.9'}, {'firstName': 'Kawhi', 'lastName': 'Leonard', 'jersey': '2', 'heightMeters': '2.01', 'nbaDebutYear': '2011', 'weightKilograms': '104.3'}, {'firstName': 'Jabari', 'lastName': 'Parker', 'jersey': '2', 'heightMeters': '2.03', 'nbaDebutYear': '2014', 'weightKilograms': '111.1'}]
Since each element in nba_players is a dictionary, we can access its values with the .values() method. Next, we run str_to_int_or_float on each value using the Python built-in map function. Finally, we convert the result to a list.
for player in nba_players:
ws.append(list(map(str_to_int_or_float, player.values())))
Fig. 2 illustrates this step using one player:
Although this may seem complicated at first, it becomes second nature after a bit of practice.
- Create an Excel table that starts at A1 and ends on the last non-empty cell:
last_cell = ws.cell(row = ws.max_row, column = ws.max_column).coordinate
player_table = Table(displayName = 'PlayerTable', ref = 'A1:{}'.format(last_cell))
- Style the table and add it to the spreadsheet. Note that you can use any of the table styles available in your Excel version (we will use Table Style Medium 6 here) without spaces in its name. Finally, save changes.
style = TableStyleInfo(name = 'TableStyleMedium6', showRowStripes=True)
player_table.tableStyleInfo = style
ws.add_table(player_table)
wb.save('players.xlsx')
Let's see in Fig. 3 what the spreadsheet should look like after resizing the columns for better visualization:
As you can see, filters were added automatically on the table for your convenience. At this point you can create charts based on this table or manipulate it as you wish - either through Excel or programmatically, as we will learn next.
Reading from an Excel File
Using openpyxl, you can also read from an existing spreadsheet. Among other things, this makes it possible to perform calculations and add content that was not part of the original dataset.
To illustrate, we will read each row and calculate the body mass index (BMI) of each player
BMI = weight / height^2
following these steps:
- Load the spreadsheet, select the Players info sheet, and choose a title for cell G1:
wb = load_workbook(filename = 'players.xlsx')
ws = wb['Players info']
ws['G1'] = 'BMI'
- Iterate over the table, beginning at row 2. Keep in mind that the given end point of the Python built-in range function is not part of the generated list. That is why we need to add 1 to ws.max_row to reach the last row.
For each row, the weight (in kilograms) is stored on the 6th column, so we use the .cell() method of the worksheet object to retrieve its value. Next, we do the same with the height (4th column). This gives us what we need to calculate the BMI - which, in this case, we are rounding to 2 decimal places. Finally, we write the result on the 7th column:
for r in range(2, ws.max_row + 1):
weight = ws.cell(row = r, column = 6).value
height = ws.cell(row = r, column = 4).value
bmi = round(weight / (height ** 2), 2)
ws.cell(row = r, column = 7).value = bmi
The .cell() method can be used to set or read the value of a cell, provided that its coordinates are known.
And don't forget to save changes:
wb.save('players.xlsx')
Here's our spreadsheet so far (see Fig. 4):
You can now extend the table to include column G by dragging the blue marker in cell F5 over to the right. This will set the same format in the new column as in the rest (refer to Fig. 5, where the blue marker is in G5 now):
As you can see, all it takes to read an Excel file is knowing how to iterate over its rows. The .cell() method takes care of everything else.
Summary
The key takeaway from this guide is that reading and writing to Excel spreadsheets using Python is a walk in the park. Even more important, we learned how to process existing data to generate information.