- Lab
- A Cloud Guru
Using Pandas Data Frames and Pivot Tables in Python
Pandas is a very popular library for performing data analysis with Python. Two common features that you'll use when working with pandas are data frames and pivot tables. In this hands-on lab, you'll go through the process of using data frames and pivot tables to analyze some employee information and provide some insight to others on your team. *Warning:* This is a lab designed as part of a professional level course and is difficult. The lab asks you to accomplish something using exact methods and functionality of the `pandas` library that might not have been covered in lessons. To feel comfortable completing this lab, you'll want to know how to do the following: * Use pandas data frames. Watch the "Creating and Using Dataframes" video from the [Using Python's Math, Science, and Engineering Libraries](https://linuxacademy.com/cp/modules/view/id/621) course. * Create pivot tables. Watch the "Creating Pivot Tables" video from the [Using Python's Math, Science, and Engineering Libraries](https://linuxacademy.com/cp/modules/view/id/621) course. * Comfortability reading the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) to find new functions and methods to use to accomplish your goal.
Path Info
Table of Contents
-
Challenge
Install Pandas
Before we can use pandas to analyze our employee data we'll need to make sure that it is installed.
-
Challenge
Create a data frame from the `employee_info` list and determine the number of unique job titles.
Our
pay_analysis.py
script already sets up a list of dictionary objects with our employee information namedemployee_info
. We need to use this list to create apandas.DataFrame
for us to be able to perform our analysis. Get the column names from an item in the list and create a new data frame calledemployee_frame
.Note: We also need to import
pandas
. -
Challenge
Use a pivot table to determine the salary difference between people doing the same job depending on their gender.
To determine the average pay difference between men and women with the same job title, we'll first need to determine which jobs have both men and women doing them. We can do this by grouping our data frame information by the job title first and then creating a
pandas.Series
that has the job titles as the index values and the number of unique genders as the values.Next, using
job_title_unique_gender_count
, we'll take all of the items that have a value of2
, drop the items that will be adjusted to aNaN
(because they aren't a2
) and then get apandas.Index
object with just the names by accessing theindex
of ourpandas.Series
. This seems a little complicated, but it's a good way for us to get just a list of the job titles where the number of genders in that role is 2.After that, we create a new data frame from our
employee_frame
data that only contains employees in a job that includes both genders.The last two things we do are creating our pivot table based on the job title where we have a column for each gender and adding a column to the table to hold the pay difference between female and male employees.
-
Challenge
Use a pivot table to determine the salary differences of people doing the same job depending on their age.
Our final pivot table will work a lot like the
gender_difference_table
, except we need to categorize each of our employees into an age range before we start making our comparisons. To do this, we'll use thepandas.cut
function and add the resulting series as a new column on ouremployee_frame
object, calling itage_range
. There are 2 possible categories for each employee to fall under "18 - 40" and "41 - 80". We're going to also add a third, unused category to the list of categories for this column so that we can add a column to our pivot table later.The data type of the
age_range
column iscategory
, and it is important that when we're adding a category to it that we use theinplace=True
option. If we don't do this, then we'll get a new category returned to us and won't modify theemployee_frame
at all.Next, we need to determine which job titles have at least one employee from each of the age ranges. This code is virtually identical to what we did to determine while job titles had employees from each of the gender options.
Now that we have a new data frame, we're ready to create our pivot table and populate a new 'difference' column. This will look very similar to our other pivot table, except we want to create columns for each of the values in our
age_range
column. Here's the rest of the code that we need to write to create our table and calculate the differences in pay:
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the author’s guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.