Data Cleaning
May 9, 2019 • 8 Minute Read
Introduction
This guide introduces the concept of data cleaning which involves the identification of incomplete data, deletion of coarse data, handling invalid data, etc. Most of the data that appears in the real word scenario is unstructured, requires reshaping, and further data preprocessing. Let us learn a few steps of data cleaning using the Pandas library available in the Python.
Consider a CSV file named student_grade.csv which holds data for nearly 1500 students based on four4 attributes. A good way to get a quick feel of the data is to take a look at the first 10 rows of the dataset as shown below:
Student_ID | St_GRADE | Left_School | St_AGE | |
---|---|---|---|---|
1 | 2K131 | 7.5 | NO | 12 |
2 | 2K152 | 7 | NO | 12 |
3 | 2K163 | n/a | YES | 13 |
4 | 2K174 | 9 | NO | 13 |
5 | 2K185 | 7 | NO | 13.67 |
6 | 2K174 | 9 | NO | 13 |
7 | 2K106 | 8 | NO | 14.79 |
8 | 2K185 | 6.3 | YES | 13 |
9 | 2K152 | 6 | NO | 13 |
10 | 2K119 | n/a | NO | 12 |
The current data set is already reshaped and therefore we define the following four objectives to perform the task of data cleaning:
- Identify the missing grades.
- Remove the student records who have left the school.
- Identify and drop the duplicate records (repeating entries).
- Identify and correct the invalid age.
The Bbaseline
To initiate the process, the first step is to import the Pandas library and read the dataset.
import pandas as pd
df = pd.read_csv('student_grade.csv')
# Storing the topmost 10 rows in a new variable df_sub
df_sub = df.head(10)
The .head(10) function is used after the df variable to extract the first 10 rows of the DataFrame.
Identify the Mmissing Vvalues
Before we dive into the code, it is important to understand the possible sources of missing data. Here are few of the reasons by which missing data can originate:
- User might have forget to fill the data.
- User might not have full information about the data that is to be filled in.
- Chances of system error.
In the column St_GRADE we can see the null values as n/a, but we have only a few records in the preview, the complete data might contain any other form of the missing values too. Let us identify the rows with n/a:
missing_values = 'n/a'
df_sub = df_sub[df_sub.St_GRADE==missing_values]
Output:
| | Student_ID | St_GRADE | Left_School | St_AGE | | --- | --- | --- | | 3 | 2K163 | n/a| YES | 13 |10 | 2K119 | n/a | NO | 12
Most of the times, missing values are not hard-coded by the user. If the user doesn't mention any specific term (here, n/a) then Pandas will assign a NaN value to each of the missing value cells. Such cells can be identified and handled as follows:
import numpy as np
dummy = pd.Series([45, np.NaN, 12])
# Checking which instances have missing values
dummy.isnull()
# Output:
# 0 False
# 1 True
# 2 False
# dtype: bool
# To fill the missing values, use fillna()
dummy = dummy.fillna(5)
dummy
# Output:
# 0 45.0
# 1 5.0
# 2 12.0
# dtype: float64
Notice from the above code that after filling an integer 5 to the dummy Series, we still receive an output with the float datatype. This is due to the presence of NaN initially in the Series whose datatype was float64.
Remove the Students’ Records Who Have Left the Schoolstudents’ records who have left the school
Next, let us remove the records of the students who have left the school. This is because, moving forward, such data might not be useful or this might lead to an increased data size. To achieve this, we subset the DataFrame that includes all the rows where the value of a cell in the St_GRADE column does not equal to YES.
# Subsetting the DataFrame df based on the St_GRADE column
df_sub = df_sub[df_sub.St_GRADE != 'YES']
You can observe the output with the deleted records as shown below:
Student_ID | St_GRADE | Left_School | St_AGE | |
---|---|---|---|---|
1 | 2K131 | 7.5 | NO | 12 |
2 | 2K152 | 7 | NO | 12 |
3 | 2K174 | 9 | NO | 13 |
4 | 2K185 | 7 | NO | 13.67 |
5 | 2K174 | 9 | NO | 13 |
6 | 2K106 | 8 | NO | 14.79 |
7 | 2K152 | 6 | NO | 13 |
Identify and Drop the Duplicate Recordsdrop the duplicate records
To identify the duplicate records of the DataFrame, we can use the duplicated() function present inside the Pandas library.
# Checking the duplicate rows
df_sub["is_duplicate"]= df_sub.duplicated()
The above code finds the duplicate instances and tags them with a boolean logic True if the rows are duplicate or else False. Later, the result is stored in the column named is_duplicate of the DataFrame df_sub as shown below:
Student_ID | St_GRADE | Left_School | St_AGE | is_duplicate | |
---|---|---|---|---|---|
1 | 2K131 | 7.5 | NO | 12 | False |
2 | 2K152 | 7 | NO | 12 | False |
3 | 2K174 | 9 | NO | 13 | False |
4 | 2K185 | 7 | NO | 13.67 | False |
5 | 2K174 | 9 | NO | 13 | True |
6 | 2K106 | 8 | NO | 14.79 | False |
7 | 2K152 | 6 | NO | 13 | True |
Now, let’s drop these duplicate records using the following code:
# Dropping the duplicate records
df_sub.drop_duplicates()
As per the above code, the first occurance of the duplicate record is kept and the second occurance of the duplicate record is deleted. However, we can also reverse this process by passing last inside the argument keep as shown:
df_sub.drop_duplicates(keep='last')
Using either of the above codesblocks, we can drop the duplicate records. However, this will be temporary and there will be no change in the original DataFrame df_sub. To reflect permanent changes to the original DataFrame, we can follow either of these steps:
# Dropping the duplicate records permanently
df_sub.drop_duplicates(inplace=True)
# Or,
df_sub = df_sub.drop_duplicates().reset_index(drop=True) # Resetting index is optional
We can even drop all the instances i.e., the original record as well as the duplicates. This can be achieved using the keep argument and this time passing False.
df_sub.drop_duplicates(keep=False, inplace=True)
The output of above code is shown below:
Student_ID | St_GRADE | Left_School | St_AGE | is_duplicate | |
---|---|---|---|---|---|
1 | 2K131 | 7.5 | NO | 12 | False |
2 | 2K152 | 7 | NO | 12 | False |
3 | 2K185 | 7 | NO | 13.67 | False |
4 | 2K106 | 8 | NO | 14.79 | False |
Identify and Correct Invalid Age of the Studentcorrect invalid age of the student
We consider an age to be invalid if its value is less than or equal to zero. This can be checked using the following query:
# Counting number of invalid student age
(df_sub.St_AGE <= 0).sum()
The result of the above code is zero.
Now, we know there are no invalid ages in the dataset. So, let us try to round up all the ages because few of the given student ages are present in the decimal format.
df_sub.St_AGE = df_sub.St_AGE.astype(int)
The above code changes all the values in the column St_AGE to integer dataype as shown:
Student_ID | St_GRADE | Left_School | St_AGE | is_duplicate | |
---|---|---|---|---|---|
1 | 2K131 | 7.5 | NO | 12 | False |
2 | 2K152 | 7 | NO | 12 | False |
3 | 2K185 | 7 | NO | 13 | False |
4 | 2K106 | 8 | NO | 14 | False |
Conclusion
This guide taught you the basic operations used for performing data cleaning by handling duplicate data, missing data and invalid data.