Cleaning Data Using Python
Sep 16, 2019 • 13 Minute Read
Introduction
Cleaning data is a critical component of data science and predictive modeling. Even the best of machine learning algorithms will fail if the data is not clean. In this guide, you will learn about the techniques required to perform the most widely used data cleaning tasks in Python.
Data
In this guide, we will be using fictitious data from loan applicants which contains 600 observations and 12 variables, as described below:
-
UID - Unique identifier of an applicant.
-
Marital_status - Whether the applicant is married ("Yes") or not ("No").
-
Dependents - Number of dependents of the applicant.
-
Is_graduate - Whether the applicant is a graduate ("Yes") or not ("No").
-
Income - Annual Income of the applicant (in US dollars).
-
Loan_amount - Loan amount (in US dollars) for which the application was submitted.
-
Term_months - Tenure of the loan (in months).
-
Credit_score - Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").
-
Age - The applicant’s age in years.
-
Sex - Whether the applicant is female (F) or male (M).
-
Purpose - Purpose of applying for the loan.
-
approval_status - Whether the loan application was approved ("1") or not ("0"). This is the dependent variable.
Let's start by loading the required libraries and the data.
# Import required libraries
import pandas as pd
import numpy as np
# Reading the data
df = pd.read_csv("data_cle.csv")
print(df.shape)
print(df.info())
Output:
(600, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 12 columns):
UID 600 non-null object
Marital_status 600 non-null object
Dependents 593 non-null float64
Is_graduate 594 non-null object
Income 600 non-null int64
Loan_amount 593 non-null float64
Term_months 594 non-null float64
Credit_score 600 non-null object
approval_status 600 non-null int64
Age 598 non-null float64
Sex 600 non-null object
Purpose 600 non-null object
dtypes: float64(4), int64(2), object(6)
memory usage: 56.3+ KB
None
The above output shows that there are 600 observations of 12 variables. Some of the variables have less than 600 records, indicating the presence of missing values.
Dropping Duplicates
One of the most common errors in real world data is the presence of duplicate records. Such records are of no use and must be removed. In our dataset, 'UID' is the unique identifier variable and will be used to drop the duplicate records. The first line of code below uses the drop_duplicates() function to find and drop the duplicates. The second line prints the shape of the data - 597 observations of 12 variables. There were three duplicates in the 'UID' variable that have been removed.
df = df.drop_duplicates('UID', keep='first')
print(df.shape)
Output:
(597, 12)
Handling Incorrect Records
Let us look at the data, using the line of code below. The output shows that the minimum value of the variable, 'Age', is -12. This is impossible and brings us to the next common problem in real world datasets: the presence of wrong records.
df.describe()
Output:
| | Dependents | Income | Loan_amount | Term_months | approval_status | Age |
|------- |------------ |-------------- |-------------- |------------- |----------------- |------------ |
| count | 590.000000 | 5.970000e+02 | 5.900000e+02 | 591.000000 | 597.000000 | 595.000000 |
| mean | 0.755932 | 7.073414e+05 | 3.287864e+05 | 366.111675 | 0.681742 | 49.030252 |
| std | 1.029987 | 7.125482e+05 | 7.293830e+05 | 59.970906 | 0.466191 | 15.698842 |
| min | 0.000000 | 3.000000e+04 | 1.860000e+04 | 60.000000 | 0.000000 | -12.000000 |
| 25% | 0.000000 | 3.852000e+05 | 6.150000e+04 | 384.000000 | 0.000000 | 36.500000 |
| 50% | 0.000000 | 5.084000e+05 | 7.650000e+04 | 384.000000 | 1.000000 | 51.000000 |
| 75% | 1.000000 | 7.661000e+05 | 1.346250e+05 | 384.000000 | 1.000000 | 61.000000 |
| max | 6.000000 | 8.444900e+06 | 7.780000e+06 | 504.000000 | 1.000000 | 76.000000 |
There are many techniques to handle such incorrect entries. One such approach is to delete the records but, instead, we will replace them with the value of 20. The assumption here is that the minimum age of loan applicants would be twenty. The first three lines of code below creates a 'for' loop, which locates the negative values in the 'Age' variable and replaces them with the value of twenty. The fourth line of code prints the summary statistics for the 'Age' variable, which shows that the correction has been done.
for i in range(df.shape[0]):
if df.Age.iloc[i] < 0.0:
df.Age.iloc[i] = 20.0
df['Age'].describe()
Output:
count 595.000000
mean 49.373109
std 14.703877
min 20.000000
25% 36.500000
50% 51.000000
75% 61.000000
max 76.000000
Name: Age, dtype: float64
Missing Values
Missing value treatment is the most common data cleaning step performed in a data science project. The isnull().sum() function can be used to find the number of missing values across the variables.
df.isnull().sum()
Output:
UID 0
Marital_status 0
Dependents 7
Is_graduate 6
Income 0
Loan_amount 7
Term_months 6
Credit_score 0
approval_status 0
Age 2
Sex 0
Purpose 0
dtype: int64
The output above shows the presence of missing values across the variables; most of which are numerical variables, except 'Is_graduate', which is a categorical variable. For categorical variables, it is important to understand the frequency distribution, which you can see in the line of code below. The output shows that the majority of the applicants were graduates, identified with the label 'Yes'.
df['Is_graduate'].value_counts()
Output:
Yes 463
No 128
Name: Is_graduate, dtype: int64
We are now ready for missing value imputation. The most widely used technique is to replace the missing values with the measures of central tendency - mean, median, and mode. This is done in the first four lines of code below. The fifth line deletes the rows where any variable may still have missing values.
#Missing value treatment for continous variable
df['Loan_amount'].fillna(df['Loan_amount'].mean(), inplace=True)
df['Term_months'].fillna(df['Term_months'].mean(), inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)
#Missing value treatment for categorical variable
df['Is_graduate'].fillna('Yes',inplace=True)
# Deleting the missing values in remaining variables
df = df.dropna()
# Checking the missing values again
df.isnull().sum()
Output:
UID 0
Marital_status 0
Dependents 0
Is_graduate 0
Income 0
Loan_amount 0
Term_months 0
Credit_score 0
approval_status 0
Age 0
Sex 0
Purpose 0
dtype: int64
Dropping Irrelevant Columns
There are two types of variables required for building machine learning models. These are termed as the dependent (or target) and the independent (or predictor) variables. The unique identifier variable, 'UID', does not belong to either of these types. We can separate or remove such variables from the data, as is done in the first line of code below. The second line prints the shape of the resulting dataframe which now contains 590 observations of 11 variables.
df = df.drop(['UID'], axis=1)
print(df.shape)
Output:
(590, 11)
Dummy Encoding
Some of the variables in the dataset, such as the 'Sex' variable, contain the labels ('M' and 'F'). For modeling using scikit-learn, all the variables should be numeric. So, we will have to convert these variables using a technique called dummy encoding. In this technique, the features are encoded so that there is no duplication of the information. This is achieved by passing in the argument drop_first=True to the .get_dummies function, as done in the first five lines of code below. The sixth line prints the information about the data, which indicates that the data now has 15 variables.
df = pd.get_dummies(df, columns=['Sex'], drop_first=True, prefix='S')
df = pd.get_dummies(df, columns=['Purpose'], drop_first=True, prefix='Pur')
df = pd.get_dummies(df, columns=['Marital_status'], drop_first=True, prefix='M')
df = pd.get_dummies(df, columns=['Is_graduate'], drop_first=True, prefix='Ed')
df = pd.get_dummies(df, columns=['Credit_score'], drop_first=True, prefix='CSc')
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 590 entries, 1 to 599
Data columns (total 15 columns):
Dependents 590 non-null float64
Income 590 non-null int64
Loan_amount 590 non-null float64
Term_months 590 non-null float64
approval_status 590 non-null int64
Age 590 non-null float64
S_F 590 non-null uint8
Pur_Education 590 non-null uint8
Pur_Furniture 590 non-null uint8
Pur_Personal 590 non-null uint8
Pur_Travel 590 non-null uint8
Pur_Wedding 590 non-null uint8
M_Yes 590 non-null uint8
Ed_Yes 590 non-null uint8
CSc_Satisfactory 590 non-null uint8
dtypes: category(1), float64(4), int64(2), uint8(8)
memory usage: 57.5 KB
Outlier Treatment
One of the biggest problems in machine learning is the presence of outliers, which are extreme values different from the other data points. For numerical variables, we can identify outliers through the skewness value. The two lines of code below prints the skewness value, along with the summary statistics, for the 'Income' variable.
print(df['Income'].skew())
df['Income'].describe()
Output:
5.31688331023
count 5.900000e+02
mean 7.112837e+05
std 7.157367e+05
min 3.000000e+04
25% 3.861750e+05
50% 5.099500e+05
75% 7.726500e+05
max 8.444900e+06
Name: Income, dtype: float64
The output above shows that the variable 'Income' has a right-skewed distribution with a skewness value of 5.3. Ideally, the skewness value should be between -1 and 1. There are many techniques for handling outliers, one of which is quantile-based capping or flooring. We will do the flooring for lower values and capping for the higher values.
The first two lines of code below print the 10th and 90th percentile of the variable, 'Income', which will be used for flooring and capping.
print(df['Income'].quantile(0.10))
print(df['Income'].quantile(0.90))
Output:
303010.0
1250010.0
Now we will execute the outlier removal, which is done in the lines of code below. The skewness value is now much better.
df["Income"] = np.where(df["Income"] <303010.0, 303010.0,df['Income'])
df["Income"] = np.where(df["Income"] >1250010.0, 1250010.0,df['Income'])
print(df['Income'].skew())
Output:
1.00971894475
Conclusion
In this guide, you have learned about the most common problems faced while dealing with the real data and the techniques for cleaning such messy data. The usage of these techniques varies with the data, the problem statement, and the machine learning algorithm selected for building the model.
To learn more about data preparation and building machine learning models using Python's 'scikit-learn' library, please refer to the following guides:
- Scikit Machine Learning
- Linear, Lasso, and Ridge Regression with scikit-learn
- Non-Linear Regression Trees with scikit-learn
- Machine Learning with Neural Networks Using scikit-learn
- Validating Machine Learning Models with scikit-learn
- Ensemble Modeling with scikit-learn
- Preparing Data for Modeling with scikit-learn
- Interpreting Data Using Descriptive Statistics with Python