Handling Categorical Data in Machine Learning Models
Discover what is categorical data and its complexity for computers. Learn about limited values and processing challenges in this brief explanation |Pluralsight
Feb 20, 2019 • 7 Minute Read
Introduction
Categorical Data is the data that generally takes a limited number of possible values. Also, the data in the category need not be numerical, it can be textual in nature. All machine learning models are some kind of mathematical model that need numbers to work with. This is one of the primary reasons we need to pre-process the categorical data before we can feed it to machine learning models.
Let's consider following data set:
import pandas as pd
df = pd.read_csv('household_data.txt')
print(df)
Output
Item_Category Gender Age Salary Purchased
0 Fitness Male 20 30000 Yes
1 Fitness Female 50 70000 No
2 Food Male 35 50000 Yes
3 Kitchen Male 22 40000 No
4 Kitchen Female 30 35000 Yes
Intuitively, you can see that Item_Category (Fitness, Food, Kitchen), Gender (Male, Female), and Purchased (Yes, No) are the categorical variables since there is only a limited set of values that these can take.
In the rest of this guide, we will see how we can use the python scikit-learn library to handle the categorical data. Scikit-learn is a machine learning toolkit that provides various tools to cater to different aspects of machine learning e.g. Classification, Regression, Clustering, Dimensionality reduction, Model selection, Preprocessing.
There is a subtle difference in how the categorical data for the dependent and independent variables are handled. We will learn more about this later in the guide. That said, we need to break our data set into the dependent matrix (X) and independent vector (y).
Fine, we’ll create the dependent matrix (X) from the data set:
X = df.iloc[:, :-1].values
print(X)
Output
['Fitness' 'Male' 20 30000]
['Fitness' 'Female' 50 70000]
['Food' 'Male' 35 50000]
['Kitchen' 'Male' 22 40000]
['Kitchen' 'Female' 30 35000]
Then, we’ll extract the dataset to get the dependent vector:
y = df.iloc[:, -1].values
print(Y)
Output
'Yes'
Encoding the Categorical Data for Independent Features Matrix X
Next, we’re going to encode the categorical data for Item_Category and Gender so that they’re changed to numbers which can then be fed to the machine learning models.
Consider the following code:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv('household_data.txt')
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
X[:,1] = labelencoder_X.fit_transform(X[:,1])
print(X)
Output
[0 1 20 30000]
[0 0 50 70000]
[1 1 35 50000]
[2 1 22 40000]
[2 0 30 35000]
In the above code, we have used the LabelEncoder class from sklearn preprocessing to transform the labels for Item_Category and Gender to numbers. So for Item_Category, 'Fitness' is assigned as 0, 'Food' is assigned as 1, and 'Kitchen' as 2. Similarly for Gender, 'Male' is assigned as 1 and 'Female' as 0. This leads us to another challenge when working with the machine learning models. Since all mathematical models deal with numbers and some numbers are greater than others, this can skew the models leading to inaccurate results.
Consider Gender, in the above output 'Male' is assigned the value as 1 while 'Female' is 0. In all the calculations that are going to take place, the weight of Male is going to be more than that of Female. Thisdoes not make sense because Gender is a category of data and both variables need to be treated equally by the model to predict accurate results.
One-Hot Encoding
The solution to this problem is achieved by incorporating the concept of dummy columns. For each of the values of a certain category, a new column is introduced. So, if the row value of Item_Category is Fitness then that row will get the value as 1 and Food and Kitchen will get the value as 0.
Item_Category |
---|
Fitness |
Food |
Kitchen |
is converted to
Fitness | Food | Kitchen |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
OneHotEncoder is the class in the scikit-learn preprocessing that helps us achive this with ease. Consider the following code block:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
df = pd.read_csv('household_data.txt')
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
X[:,1] = labelencoder_X.fit_transform(X[:,1])
onehotencoder = OneHotEncoder(categorical_features=[0,1])
X = onehotencoder.fit_transform(X).toarray()
print(X)
Output
[1.0e+00 0.0e+00 0.0e+00 0.0e+00 1.0e+00 2.0e+01 3.0e+04]
[1.0e+00 0.0e+00 0.0e+00 1.0e+00 0.0e+00 5.0e+01 7.0e+04]
[0.0e+00 1.0e+00 0.0e+00 0.0e+00 1.0e+00 3.5e+01 5.0e+04]
[0.0e+00 0.0e+00 1.0e+00 0.0e+00 1.0e+00 2.2e+01 4.0e+04]
[0.0e+00 0.0e+00 1.0e+00 1.0e+00 0.0e+00 3.0e+01 3.5e+04]
You may notice that the columns have increased in the data set. The column 'Item_Category' is broken into three columns and column Gender is broken into two columns. Thus, the resulting number of columns in X vector is increased from four to seven. Also, notice that after applying the OneHotEncoding function, the values in the Panda Dataframe are changed to scientific notation.
Encoding the Dependent Vector Y
Encoding the dependent vector is much simpler than that of independent variables. For the dependent variables, we don't have to apply the One-Hot encoding and the only encoding that will be utilized is Lable Encoding. In the below code we are going to apply label encoding to the dependent variable, which is 'Purchased' in our case.
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
print(y)
Output
1 0 1 0 1
Conclusion
Understanding the categorical data is one of the most important aspects of dealing with Data Science. The human mind is designed in a way so that it is easy to understand the representations of the data when presented in the categorical forms. On the other hand, it is not easy for the computers to work with this kind of data, as mathematical equations don't like the input in this form. So firm understanding of concepts required to handle categorical data is a requirement when starting to design your machine learning solutions. It is worth mentioning that not just the input but the ultimate output of your model is also important. If the output of your model is an input to some other data engine than it is best to leave it in the numeric form. However, if the ultimate user of the solution is a human than probably you may want to change the numeric data to categories to help them make easy sense of it.