Handling Missing Data in Machine Learning Models
There are many ways that a user can handle missing data, from deleting the data points having missing data to interpolation, each with their own risks.
Jan 29, 2019 • 6 Minute Read
Introduction
Missing data is one of the annoying aspects that occur when dealing with data sets of varying sizes. There are multiple reasons due to which data might be missing in the data sets. Some of the common reasons are:
- Data is merged from various sources: One set of data did not capture some value and another set did not have other value. So, there will be gaps in the data.
- Data gets richer over time: The data that was collected first, chronologically, may not have the attributes that were collected at a later time.
- Data not getting collected anymore: Due to some reasons (ethical, political) the data may not be collected for certain attributes e.g. Government may decide that data related to religion, race, and ethnicity should not be collected anymore.
- During a survey process, some individuals could not answer all of the questions.
Terminology Used To Describe the Missing Data
Based on the origin of the missing data, the following terminology is used to describe it:
- Missing At Random (MAR): This category of missing data refers to the attributes that could not be answered due to the way the survey was designed. For example, consider the following questions in a survey:
a. Do you smoke? Yes, No
b. If yes, how frequently? once a week, once a day, twice a day, more than 2 times in a day
You can see that answer to question b can be given only if the answer to the question a is ‘Yes’. This kind of missing values in the dataset arise due to the dependency of one attribute on another attribute.
-
Missing Completely At Random (MCAR): This category of missing data is truly missed data or data that was not captured due to oversight or for other reasons. In a survey, a person may take a break while filling in a questionnaire and, after coming back, he may start from the next page leaving a few of the questions on the previous page unanswered.
-
Missing Not At Random (MNAR): This category of missing data is dependent on the value of the data itself. For example, a survey needs people to reveal their 10th-grade marks in Chemistry. It may happen that people with lower marks may choose not to reveal them, so you would see only high marks in the data sample.
What to Do with Missing Data
There are two primary ways in which we can handle the missing data.
Deleting the Data
In this method of handling missing data, the user removes the record or column for which data is missing from the data set.
Let’s consider the following data set:
import pandas as pd
df = pd.read_csv('household_data_missing.csv')
print(df)
Output:
Item_Category Gender Age Salary Purchased satisfaction
0 Fitness Male 20 NaN Yes NaN
1 Fitness Female 50 70000.0 No NaN
2 Food Male 35 50000.0 Yes NaN
3 Kitchen Male 22 NaN No NaN
4 Kitchen Female 30 35000.0 Yes NaN
Remove all of the columns that have all values as NA.
print(df.dropna(axis='columns', how='all'))
Output:
Item_Category Gender Age Salary Purchased
0 Fitness Male 20 NaN Yes
1 Fitness Female 50 70000.0 No
2 Food Male 35 50000.0 Yes
3 Kitchen Male 22 NaN No
4 Kitchen Female 30 35000.0 Yes
Retain all rows that have at least five values present.
print(df.dropna(axis='rows', thresh=5))
Output:
Item_Category Gender Age Salary Purchased satisfaction
1 Fitness Female 50 70000.0 No NaN
2 Food Male 35 50000.0 Yes NaN
4 Kitchen Female 30 35000.0 Yes NaN
Interpolation
It is advisable to retain the data as much as possible without deleting it. To achieve this, the user can utilize the available data points to estimate the values of the unknown data by using the technique known as interpolation. There are various methods provided in pandas interpolate function that can be used to obtain the data values.
print(df.interpolate(method='linear'))
Output:
Item_Category Gender Age Salary Purchased satisfaction
0 Fitness Male 20 25000.000000 Yes NaN
1 Fitness Female 50 70000.000000 No NaN
2 Food Male 35 58333.333333 Yes NaN
3 Kitchen Male 22 46666.666667 No NaN
4 Kitchen Female 30 35000.000000 Yes NaN
print(df.interpolate(method='quadratic'))
Output:
Item_Category Gender Age Salary Purchased satisfaction
0 Fitness Male 20 25000.000000 Yes NaN
1 Fitness Female 50 70000.000000 No NaN
2 Food Male 35 86666.666667 Yes NaN
3 Kitchen Male 22 75000.000000 No NaN
4 Kitchen Female 30 35000.000000 Yes NaN
Other Methods
There are many other methods that are provided that can be used in different situations:
- Spline: If the estimated values are outside the known minimum and maximum range.
- Kringing: This model uses the correlation of all existing data points to predict the values of the missing data.
- Quadratic: When the value of data is changing at an increased rate.
- Akima: If the aim is to get the smooth movement from one point to another then Akima interpolation should be used.
Conclusion
There are many ways that a user can handle missing data, from deleting the data points having missing data to interpolation. However, there are many factors and risks involved in each of the strategies that need to be understood before making the selection of the method. As seen above user should try to make use of data available in hand as much as possible but using Interpolation over a scarcely scattered data may lead to overfitting of the data and thus resulting in unpredictable results.