Preparing Data for Machine Learning
Oct 28, 2019 • 14 Minute Read
Introduction
It's a well known saying that the quality of output depends on the quality of input. A machine learning model is no different. The accuracy of your machine learning model prediction depends on the quality of data and the amount of data that is fed to it. Though we cannot control the amount of data, we certainly can control the quality of it. It's not uncommon in typical machine learning projects for teams to spend 50%-60% of their time preparing data.
In this article, you will learn different techniques for using some of the standard Python libraries, such as pandas and numpy, to convert raw data to quality data.
Data Preparation Basics
Data preparation in a machine learning project can be broadly subdivided into two major categories.
1. Data wrangling. Also known as data munging, this is the process of normalizing data, identifying missing data and performing cleanup either to remove the missing data or transform existing data using basic statistical operations like mean or median to impute missing values.
2. Feature engineering. In machine learning terminology, a column of data is often called a "feature." In most cases, we may need to combine two or more features to create a single feature or to derive a new feature-based on the existing feature set. For example, if the data has "employee hire date" as one of the features, we can derive the employee's age with the company by subtracting the current date and his hire date.
In this article, we are going to use a hypothetical store that has branches all over the U.S. The sample data includes columns for total inventory items and the number of items that have been sold by a specific branch on a specific date. This can be viewed as a small sample of a huge data set.
Lets import pandas and print the data set.
import pandas as pd
df = pd.read_csv('Downloads/inventory.csv')
print(df)
Date Amount SKU dept # Sold branch id branch loc Tot Inv
0 9/10/2019 3231.00 232 100 125 1123 tx 225
1 9/10/2019 4455.00 345 100 343 1123 tx 400
2 9/9/2019 322.90 231 100 5 2343 il 25
3 9/16/2019 33.88 342 100 1 2341 ma 10
4 9/17/2019 34.44 456 300 2 3323 ma 25
5 9/14/2019 NaN 238 300 44 4565 ma 100
6 9/15/2019 181.90 678 300 23 1123 tx 50
7 9/13/2019 3434.77 567 300 771 2343 il 1000
8 9/9/2019 22.11 453 400 2 2343 il 5
9 9/16/2019 23.10 560 400 3 1123 ca 5
10 9/10/2019 33.56 789 200 34 3456 ca 50
11 9/9/2019 67.88 679 200 45 6655 ca 50
12 9/16/2019 23.33 134 200 2 3438 ri 5
13 9/14/2019 34.00 567 300 33 3356 ny 50
14 9/14/2019 6777.77 667 700 7788 6678 ny 10000
15 9/15/2019 233.33 889 700 233 2234 tx 500
16 9/13/2019 989.99 898 700 213 2245 sc 500
17 9/15/2019 9899.88 901 900 234 3344 nc 500
18 9/10/2019 NaN 998 900 33 2233 fl 100
Data wrangling
Missing Data
Methods for dealing with missing data include deleting, imputing, or predicting.
1. Delete. One of the easier ways to address null values in your data is just to drop them. This is a preferred approach if the null values are relatively smaller in number. If not we might be losing some meaningful information by dropping them. Pandas have dropna() function that can be used in dropping all null values. In our sample data, there are two rows with null values.
df.dropna()
Once we execute the above statements, you can see the resulting data with null rows being removed.
Date Amount SKU dept # Sold branch id branch loc Tot Inv
0 9/10/2019 3231.00 232 100 125 1123 tx 225
1 9/10/2019 4455.00 345 100 343 1123 tx 400
2 9/9/2019 322.90 231 100 5 2343 il 25
3 9/16/2019 33.88 342 100 1 2341 ma 10
4 9/17/2019 34.44 456 300 2 3323 ma 25
6 9/15/2019 181.90 678 300 23 1123 tx 50
7 9/13/2019 3434.77 567 300 771 2343 il 1000
8 9/9/2019 22.11 453 400 2 2343 il 5
9 9/16/2019 23.10 560 400 3 1123 ca 5
10 9/10/2019 33.56 789 200 34 3456 ca 50
11 9/9/2019 67.88 679 200 45 6655 ca 50
12 9/16/2019 23.33 134 200 2 3438 ri 5
13 9/14/2019 34.00 567 300 33 3356 ny 50
14 9/14/2019 6777.77 667 700 7788 6678 ny 10000
15 9/15/2019 233.33 889 700 233 2234 tx 500
16 9/13/2019 989.99 898 700 213 2245 sc 500
17 9/15/2019 9899.88 901 900 234 3344 nc 500
2. Impute. There are cases where we cannot afford to drop existing data, especially if the sample size of the data is relatively small or if the ratio of null values is relatively high. In these cases, we need to impute missing data, and different strategies are available for that. Some of the commonly used approaches for continuous data are the mean/average, median, or mode value of the features. For categorical data, the mode is always the preferred approach. Pandas have fillna() method to accomplish this.
3. Predict. This is the scenario where we cannot afford to guess the wrong value. Instead of imputing random values, we predict the values using machine learning algorithms. We use a regression model to predict continuous data and a classification model for categorical data.
Detecting Outliers
While preparing the data, we must look out for extreme values. Some of these values may be genuine cases, but some could be erroneous. The presence of outliers would significantly affect the modeling process and hence the prediction accuracy.
For example, in the data above you can see that observation (row) 17 has an extreme price. Looking at other data, it seems there is a high possibility that this could be a user error.
The z-score of observation is a common way to detect outliers. To calculate z-score of the 'Amount' feature, we will use the following formula.
zscore = (Amount - Mean Amount)
----------------------------
Standard Deviation of Amount
We can set a specific threshold for standard deviation (> 2.0 or 2.5), and once the z-score exceeds this value, we can safely reject values as outliers. Let's compute the zscore of the 'Amount' feature and plot a graph.
df['Amount_zscore'] = (df['Amount'] - df['Amount'].mean())/df['Amount'].std()
print(df['Amount_zscore'])
0 0.508907
1 0.930321
2 -0.492328
3 -0.591836
4 -0.591643
6 -0.540874
7 0.579064
8 -0.595888
9 -0.595547
10 -0.591946
11 -0.580130
12 -0.595468
13 -0.591794
14 1.730032
15 -0.523167
16 -0.262655
17 2.804950
df['Amount_zscore'].plot.kde()
This is one technique used to detect and eliminate outliers. There are other techniques, like Dbscan and Isolation Forest, that can be used based on particular data. The explanation of these is beyond the scope of this article.
Normalization
When you have data with multiple features and each has a different unit of measurement, there is a high possibility of skewed data. Because of this, it's important that we convert all possible features to the same standard scale. This technique is called normalizing, or feature scaling.
For example, in our data, inventory quantities range from 1 to 1000, whereas cost ranges from 1 to 10000.
Min-Max scaling is a commonly used technique in normalizing the data. The formula is:
Feature(Normalized) = (Feature Value - Min. Feature Value)
--------------------------------------------
(Max. Feature Value - Min. Feature Value)
It's important to apply the outlier technique mentioned above before normalizing the data. Otherwise, you will run the risk of skewing the normal values in your data to a small interval.
Encoding data
Data is not always numerical. You may have, for example, text data that is categorical. Referring to our dataset, though most of the features are numerical, "branch loc" refers to the state in which a specific branch is located, and it contains text data. As part of preparing the data, it's important to convert this feature to numerical data. There are many ways you can do this. We are going to use a "Label Encoding" technique.
df["branch loc"] = df["branch loc"].astype('category')
df["branch_loc_cat"] = df["branch loc"].cat.codes
print (df[["branch loc", "branch_loc_cat"]])
branch loc branch_loc_cat
0 tx 7
1 tx 7
2 il 1
3 ma 2
4 ma 2
6 tx 7
7 il 1
8 il 1
9 ca 0
10 ca 0
11 ca 0
12 ri 5
13 ny 4
14 ny 4
15 tx 7
16 sc 6
17 nc 3
You can see that for each state, a numerical value has been assigned. We first converted this feature to a categorical type before applying categorical codes to it.
Feature Engineering
Feature engineering requires expertise both in domain knowledge and technical knowledge. Having too few features in the data may result in a poorly performing model, and too many features may result in an overly complex model. When there are too many features in the training data, it's highly possible that the model may be over-fitting the data — that is, performing accurately on training data but poorly on new, untrained test data. It's important to select an optimal number of features that help us design a better performing model. Selecting an optimal set of features is a two-step process :
-
Removing unwanted features from the features list, also called "feature selection."
-
Extracting new features from the existing set of features, also called "feature extraction."
Dimensionality Reduction
Having too many features in the data will increase the complexity of the model prediction, training time, and computation cost, and it may decrease the accuracy of the prediction because of too many variables. So it is advisable to reduce the number of features that can yield optimal accuracy in the prediction process. One technique for doing this is dimensionality reduction. It is achieved in two different ways:
-
Identifying the features that are relevant for the training process and eliminating those features that are redundant and provide less meaning to the data.
-
Combining multiple features and creating a new set of features without losing the information conveyed in those features.
There are many other techniques used to accomplish this. Some of the most common are:
-
Principal Component Analysis (PCA)
-
Random Forest
-
Low Variance Filter
-
High Correlation Filter
Imbalanced Data
For business problems like credit card fraud detection, spam filtering, or medical diagnosis, the actual data could be less than 1% of the actual sample size. In cases like this, we need to be careful not to reject accurate data as noise or outliers. If we use accuracy as a performance metric for this model, it's obvious that the model will predict every credit card transaction with 99% accuracy. But businesses are more concerned about the 1% of false credit card transactions. So, accuracy is not the best performance metric in this case, and we may need to rely on other performance metrics like precision, recall, and sensitivity. All these metrics can be derived once we have tabulated the confusion matrix.
Some of the commonly used techniques to improve prediction score include:
-
Resampling — Undersample majority class: When we have a lot of positive data (Valid credit card/Valid email) in our sample dataset compared to the negative data (invalid Credit card/spam email), we can use this technique to remove some of the observations of positive data. By reducing the ratio of positive data, we effectively increase the ratio of negative sample data.
-
Resampling — Oversample minority class: In this technique, we add more copies of negative data (invalid credit card/spam email) to our sample dataset. We need to be cautious not to run into the risk of overfitting the data.
-
SMOTE technique: Synthetic Minority Oversampling Technique (SMOTE) is used to generate new and synthetic data that can be used as training data to train the model.
Conclusion
Though this guide addresses the important aspects of preparing the data, it is just the tip of the iceberg. Multiple other approaches and techniques are available depending on the type of data. The Python ecosystem offers multiple libraries that are well equipped to address this optimally. This is also an iterative process, and you may need to go through some of these steps more than once to fine-tune your data to prepare a better-predicting model.