Cleaning Data Using Python

By Deepika Singh

Sep 16, 2019 • 13 Minute Read

Introduction

Cleaning data is a critical component of data science and predictive modeling. Even the best of machine learning algorithms will fail if the data is not clean. In this guide, you will learn about the techniques required to perform the most widely used data cleaning tasks in Python.

Data

In this guide, we will be using fictitious data from loan applicants which contains 600 observations and 12 variables, as described below:

UID - Unique identifier of an applicant.
Marital_status - Whether the applicant is married ("Yes") or not ("No").
Dependents - Number of dependents of the applicant.
Is_graduate - Whether the applicant is a graduate ("Yes") or not ("No").
Income - Annual Income of the applicant (in US dollars).
Loan_amount - Loan amount (in US dollars) for which the application was submitted.
Term_months - Tenure of the loan (in months).
Credit_score - Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").
Age - The applicant’s age in years.
Sex - Whether the applicant is female (F) or male (M).
Purpose - Purpose of applying for the loan.
approval_status - Whether the loan application was approved ("1") or not ("0"). This is the dependent variable.

Let's start by loading the required libraries and the data.

          # Import required libraries
import pandas as pd
import numpy as np 

# Reading the data
df = pd.read_csv("data_cle.csv")
print(df.shape)
print(df.info())
    

Output:

          (600, 12)
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 600 entries, 0 to 599
    Data columns (total 12 columns):
    UID                600 non-null object
    Marital_status     600 non-null object
    Dependents         593 non-null float64
    Is_graduate        594 non-null object
    Income             600 non-null int64
    Loan_amount        593 non-null float64
    Term_months        594 non-null float64
    Credit_score       600 non-null object
    approval_status    600 non-null int64
    Age                598 non-null float64
    Sex                600 non-null object
    Purpose            600 non-null object
    dtypes: float64(4), int64(2), object(6)
    memory usage: 56.3+ KB
    None
    

The above output shows that there are 600 observations of 12 variables. Some of the variables have less than 600 records, indicating the presence of missing values.

Dropping Duplicates

One of the most common errors in real world data is the presence of duplicate records. Such records are of no use and must be removed. In our dataset, 'UID' is the unique identifier variable and will be used to drop the duplicate records. The first line of code below uses the drop_duplicates() function to find and drop the duplicates. The second line prints the shape of the data - 597 observations of 12 variables. There were three duplicates in the 'UID' variable that have been removed.

          df = df.drop_duplicates('UID', keep='first')
print(df.shape)
    

Output:

      (597, 12)

Handling Incorrect Records

Let us look at the data, using the line of code below. The output shows that the minimum value of the variable, 'Age', is -12. This is impossible and brings us to the next common problem in real world datasets: the presence of wrong records.

      df.describe()

Output:

          |       	| Dependents 	| Income       	| Loan_amount  	| Term_months 	| approval_status 	| Age        	|
|-------	|------------	|--------------	|--------------	|-------------	|-----------------	|------------	|
| count 	| 590.000000 	| 5.970000e+02 	| 5.900000e+02 	| 591.000000  	| 597.000000      	| 595.000000 	|
| mean  	| 0.755932   	| 7.073414e+05 	| 3.287864e+05 	| 366.111675  	| 0.681742        	| 49.030252  	|
| std   	| 1.029987   	| 7.125482e+05 	| 7.293830e+05 	| 59.970906   	| 0.466191        	| 15.698842  	|
| min   	| 0.000000   	| 3.000000e+04 	| 1.860000e+04 	| 60.000000   	| 0.000000        	| -12.000000 	|
| 25%   	| 0.000000   	| 3.852000e+05 	| 6.150000e+04 	| 384.000000  	| 0.000000        	| 36.500000  	|
| 50%   	| 0.000000   	| 5.084000e+05 	| 7.650000e+04 	| 384.000000  	| 1.000000        	| 51.000000  	|
| 75%   	| 1.000000   	| 7.661000e+05 	| 1.346250e+05 	| 384.000000  	| 1.000000        	| 61.000000  	|
| max   	| 6.000000   	| 8.444900e+06 	| 7.780000e+06 	| 504.000000  	| 1.000000        	| 76.000000  	|
    

There are many techniques to handle such incorrect entries. One such approach is to delete the records but, instead, we will replace them with the value of 20. The assumption here is that the minimum age of loan applicants would be twenty. The first three lines of code below creates a 'for' loop, which locates the negative values in the 'Age' variable and replaces them with the value of twenty. The fourth line of code prints the summary statistics for the 'Age' variable, which shows that the correction has been done.

          for i in range(df.shape[0]): 
    if df.Age.iloc[i] < 0.0:
        df.Age.iloc[i] = 20.0
        
df['Age'].describe()
    

Output:

          count    595.000000
    mean      49.373109
    std       14.703877
    min       20.000000
    25%       36.500000
    50%       51.000000
    75%       61.000000
    max       76.000000
    Name: Age, dtype: float64
    

Missing Values

Missing value treatment is the most common data cleaning step performed in a data science project. The isnull().sum() function can be used to find the number of missing values across the variables.

      df.isnull().sum()

Output:

          UID                0
    Marital_status     0
    Dependents         7
    Is_graduate        6
    Income             0
    Loan_amount        7
    Term_months        6
    Credit_score       0
    approval_status    0
    Age                2
    Sex                0
    Purpose            0
    dtype: int64
    

The output above shows the presence of missing values across the variables; most of which are numerical variables, except 'Is_graduate', which is a categorical variable. For categorical variables, it is important to understand the frequency distribution, which you can see in the line of code below. The output shows that the majority of the applicants were graduates, identified with the label 'Yes'.

      df['Is_graduate'].value_counts()

Output:

          Yes    463
    No     128
    Name: Is_graduate, dtype: int64
    

We are now ready for missing value imputation. The most widely used technique is to replace the missing values with the measures of central tendency - mean, median, and mode. This is done in the first four lines of code below. The fifth line deletes the rows where any variable may still have missing values.

          #Missing value treatment for continous variable
df['Loan_amount'].fillna(df['Loan_amount'].mean(), inplace=True)
df['Term_months'].fillna(df['Term_months'].mean(), inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)

#Missing value treatment for categorical variable
df['Is_graduate'].fillna('Yes',inplace=True)

# Deleting the missing values in remaining variables
df = df.dropna()

# Checking the missing values again
df.isnull().sum()
    

Output:

          UID                0
    Marital_status     0
    Dependents         0
    Is_graduate        0
    Income             0
    Loan_amount        0
    Term_months        0
    Credit_score       0
    approval_status    0
    Age                0
    Sex                0
    Purpose            0
    dtype: int64
    

Dropping Irrelevant Columns

There are two types of variables required for building machine learning models. These are termed as the dependent (or target) and the independent (or predictor) variables. The unique identifier variable, 'UID', does not belong to either of these types. We can separate or remove such variables from the data, as is done in the first line of code below. The second line prints the shape of the resulting dataframe which now contains 590 observations of 11 variables.

          df = df.drop(['UID'], axis=1)
print(df.shape)
    

Output:

      (590, 11)

Dummy Encoding

Some of the variables in the dataset, such as the 'Sex' variable, contain the labels ('M' and 'F'). For modeling using scikit-learn, all the variables should be numeric. So, we will have to convert these variables using a technique called dummy encoding. In this technique, the features are encoded so that there is no duplication of the information. This is achieved by passing in the argument drop_first=True to the .get_dummies function, as done in the first five lines of code below. The sixth line prints the information about the data, which indicates that the data now has 15 variables.

          df = pd.get_dummies(df, columns=['Sex'], drop_first=True, prefix='S')
df = pd.get_dummies(df, columns=['Purpose'], drop_first=True, prefix='Pur')
df = pd.get_dummies(df, columns=['Marital_status'], drop_first=True, prefix='M')
df = pd.get_dummies(df, columns=['Is_graduate'], drop_first=True, prefix='Ed')
df = pd.get_dummies(df, columns=['Credit_score'], drop_first=True, prefix='CSc')
df.info()
    

Output:

          <class 'pandas.core.frame.DataFrame'>
    Int64Index: 590 entries, 1 to 599
    Data columns (total 15 columns):
    Dependents          590 non-null float64
    Income              590 non-null int64
    Loan_amount         590 non-null float64
    Term_months         590 non-null float64
    approval_status     590 non-null int64
    Age                 590 non-null float64
    S_F                 590 non-null uint8
    Pur_Education       590 non-null uint8
    Pur_Furniture       590 non-null uint8
    Pur_Personal        590 non-null uint8
    Pur_Travel          590 non-null uint8
    Pur_Wedding         590 non-null uint8
    M_Yes               590 non-null uint8
    Ed_Yes              590 non-null uint8
    CSc_Satisfactory    590 non-null uint8
    dtypes: category(1), float64(4), int64(2), uint8(8)
    memory usage: 57.5 KB
    

Outlier Treatment

One of the biggest problems in machine learning is the presence of outliers, which are extreme values different from the other data points. For numerical variables, we can identify outliers through the skewness value. The two lines of code below prints the skewness value, along with the summary statistics, for the 'Income' variable.

          print(df['Income'].skew())
df['Income'].describe()
    

Output:

          5.31688331023

    count    5.900000e+02
    mean     7.112837e+05
    std      7.157367e+05
    min      3.000000e+04
    25%      3.861750e+05
    50%      5.099500e+05
    75%      7.726500e+05
    max      8.444900e+06
    Name: Income, dtype: float64
    

The output above shows that the variable 'Income' has a right-skewed distribution with a skewness value of 5.3. Ideally, the skewness value should be between -1 and 1. There are many techniques for handling outliers, one of which is quantile-based capping or flooring. We will do the flooring for lower values and capping for the higher values.

The first two lines of code below print the 10th and 90th percentile of the variable, 'Income', which will be used for flooring and capping.

          print(df['Income'].quantile(0.10))
print(df['Income'].quantile(0.90))
    

Output:

          303010.0
    1250010.0
    

Now we will execute the outlier removal, which is done in the lines of code below. The skewness value is now much better.

          df["Income"] = np.where(df["Income"] <303010.0, 303010.0,df['Income'])
df["Income"] = np.where(df["Income"] >1250010.0, 1250010.0,df['Income'])
print(df['Income'].skew())
    

Output:

      1.00971894475

Conclusion

In this guide, you have learned about the most common problems faced while dealing with the real data and the techniques for cleaning such messy data. The usage of these techniques varies with the data, the problem statement, and the machine learning algorithm selected for building the model.

To learn more about data preparation and building machine learning models using Python's 'scikit-learn' library, please refer to the following guides:

Deepika S.

Coming soon...

More about this author

Cleaning Data Using Python

Introduction

Data

Dropping Duplicates

Handling Incorrect Records

Missing Values

Dropping Irrelevant Columns

Dummy Encoding

Outlier Treatment

Conclusion

Advance your tech skills today