Machine Learning Concepts with Python and scikit-learn
Dec 20, 2019 • 16 Minute Read
Introduction
This guide will discuss algorithms for supervised machine learning. These algorithms are useful when the features of interest for a dataset are well known before training the model. For example, this guide will first look at classification. In classification, each observation in the dataset corresponds to one of a finite set of labels, or classes. The job is to (accurately) predict the appropriate label for any new observations. This supervised method is in contrast to unsupervised methods, such as clustering. The task of unsupervised machine learning algorithms is to find features that are common in a dataset among the observations. Again, this guide will focus on supervised methods.
If you’d like to follow along, I recommend using Google Colab. This is a free, browser-based tool based on Jupyter notebook. To get started, all you need to do is go to https://colab.research.google.com/ in a recent browser (Google Chrome works best) and log in if prompted with a Gmail or other email associated with a Google account. Once you are logged in, create a new Python 3 notebook.
In the new notebook, click the Connect link in the upper right to provision and attach the notebook to a virtual machine running on Google Compute Engine. Once connected, you’ll be able to enter Python code in the cells and press Shift-Enter or click the run button on the left of the cell to execute it. This is similar to the JavaScript console in the Chrome Developer Tools except it does much, much more!
If you’d like to learn more about Jupyter notebook, most of which is applicable to Google Colab, check out my video course in the Pluralsight catalog, “Getting Started with Jupyter Notebook and Python.”
The Concept of Classification
Let’s say we are going to teach a group a children the difference between bicycles, automobiles and airplanes, and we are going to do this by showing them labeled pictures of each one. The idea is that the children will begin to associate characteristics of each vehicle with the corresponding labels. They might notice that bicycles have wheels but no motor, carry only a single passenger, and travel on the ground. For automobiles, they might notice vehicles that have wheels, are motorized, carry more than one passenger, and also travel on the ground. Airplanes have wheels and are motorized, but they may carry one, a few, or many passengers and do not travel on the ground—at least most of the time.
These characteristics of each vehicle are called features in machine learning terminology. In this case, we might zero in on four features:
- Does it have wheels?
- Is it motorized?
- How many passengers can it carry?
- Does it travel on ground on in air?
And in the previous paragraph, we associated a set of features with each vehicle, or label:
- Bicycle
- Automobile
- Airplane
If we showed the children enough pictures, they would be able to associate a picture of a bus or a car with the label “automobile” most of the time. And whether a twin engine prop or a 737, they would conclude it was an airplane. But what if we showed them a picture of a skateboard? A skateboard has wheel and travels on the ground, so it’s not an airplane. But it doesn’t have a motor so it’s not an automobile. That means, by process of elimination, it must be a bicycle, right? Obviously this is incorrect, but it shows an important point, and potential fault, of classification. Every observation will be associated with one of the finite set of labels. There is no ‘none of the above’.
Classification in scikit-learn
The process we discussed in the previous section is obviously not how a computer would attempt classification. We described the intuitive steps a human being would take. Computers, on the other hand, operate logically and mathematically, and math is not always intuitive.
Fortunately, the Python community has provided the scikit-learn package to insulate us from much of the internal workings of machine learning algorithms (ie. math) so we may focus on configuring the algorithms through hyperparameters. Now, that’s a big word, but as developers, we have a useful analogy for the concept of hyperparameters.
Let’s say we are going to generate a password. Different applications and services have different requirements for passwords, such as length, diversity of the characters, or certain characters that might not be allowed. One solution is to write a class and attach properties to the class for the length of the password to be generated and so on. We simply create an instance of the class, assign values to the properties to configure the strength of the generated password, call a method on the class to generate it, and use the return value as the password. The properties of the class are conceptually similar to the hyperparameters. Basically, the hyperparameters allow us to modify the process without having to get involved with the details.
And scikit-learn implements classification algorithms as classes! All we need to do is create an instance of the class, set the hyperparameters, and call a method to train a model by providing it data to analyze. This training process is analogous to the details of password generation. We don’t need to know, or care, about how it works. We need only concern ourselves with the properties or hyperparameters to tailor the process to our specific task.
The Data Dilemma
But what about this model? A model in machine learning is a numerical and mathematical representation of the “knowledge” derived from analyzing, in the case of supervised learning, labeled observations in a dataset. It will be the model that accepts new data and predicts a label. But the model must first be trained. To do this, we need data, lots of it in production. In any case, the data needs to be cleaned/normalized/prepared. This is a topic worthy of a whole series of guides in itself, and it’s not specific to machine learning. Therefore, I’m not going to spend a lot of time on it here. Instead, I’ll just use some of the example datasets that come with scikit-learn.
One of these is the wine dataset. To load it in Google Colab (which has already installed scikit-learn), import the load_wine() function from the sklearn.datasets module:
from sklearn.datasets import load_wine
wine_data = load_wine()
The return value of load_wine() is a dictionary.
wine_data.keys()
The DESCR key is an explanation of the structure of the dataset. To read it, use the print() function to maintain formatting:
print(wine_data['DESCR']) Mostly, we are going to be concerned with the data and target keys. The data key contains the values for the features. In this dataset the features are chemical properties of various wines. To see them, look at the feature_names key in the wine_data dictionary. The target key contains the labels for the wines.
wine_data['target']
It’s all integers. To be precise, this is a representation of the targets. Scikit-learn, and machine learning in general, likes numbers more than strings. If you want to see the actual names of the targets, get the target_names key:
wine_data['target_names']
Granted, this isn’t much more descriptive. But the point is that we have three distinct labels, and each observation in the dataset corresponds to one of those three labels. Now I’ll use pandas to arrange the data in a DataFrame to make it easier to work with while training the model.
import pandas as pd
wine_data_frame = pd.DataFrame(data=wine_data['data'], columns=wine_data['feature_names'])
wine_data_frame['class'] = wine_data['target']
wine_classes = [wine_data_frame[wine_data_frame['class'] == x for x in range(3)]
testing_data = []
for wine_class in wine_classes:
row = wine_class.sample()
testing_data.append(row)
wine_data_frame = wine_data_frame.drop(row.index)
All I’ve done here is separate the DataFrame by class into wine_classes. The wine_classes are also DataFrames that I select a random observation from using the sample() method. This way I am guaranteed to get a sample from each wine class for testing. To make sure the model is not trained on the testing data, I drop each testing row from the wine_data_frame. This is not the best approach for production, but it will suffice for such a small dataset. (We’ll see a better method using scikit-learn later.)
Now let’s train the model.
Getting to Know the Neighbors
When considering algorithms for classification, scikit-learn provides about a dozen choices. Depending on the situation, some are better choices than others. But a good general-purpose algorithm is K-Nearest Neighbors. While a complete explanation of K-Nearest Neighbors (hereafter referred to as KNN) is beyond the scope of this guide, here is a brief introduction.
Let’s keep things simple and assume for the moment that the observations have two features. This way we can visualize them easily using a 2D scatter plot.
I’ll add new data to classify with a large green point.
To classify this new data using KNN, the algorithm will measure the distance from the new point to a number of points closest to the new point, or neighbors. Using these measurements, it will count how many neighbors belong to each class, and the class with the highest count is the predicted label of the new point. In this case, it would predict the class represented by the orange points.
How to do this with scikit-learn? As I said before, there are classes for each classification algorithm. For KNN, the class is KNeighborsClassifier from the sklearn.neighbors module.
from sklearn.neighbors import KNeighborsClassifier
Now I can create an instance of the class.
model = KNeighborsClassifier()
And train the model using the fit() method. The parameters to fit() are the features and targets in the training dataset, which is in the wine_data_frame.
model.fit(wine_data_frame[wine_data_frame.columns[:-1]].values, wine_data_frame[wine_data_frame.columns[-1]])
Finally, using the predict() method, we can pass new data to the model and get a prediction.
model.predict(testing_data[0][:-1].reshape(1, -1)
As this returns 0, which corresponds to the target in the first testing row, the model is working well so far. I’ll leave the rest of the testing data set as an exercise for the reader. In case you are wondering, the reshape() method transforms the values in the testing data from a row vector to a column vector.
You might be asking yourself, how many neighbors does the KNeighborsClassifier use to make predictions? The answer is however many you tell it to. By default, this number is 5, yet you can change this with the n_neighbors= keyword argument in the KNeighborsClassifier initializer. Since this is being used to configure the KNeighborsClassifier, it’s a hyperparameter.
model = KNeighborsClassifier(n_neighbors=3)
There are many other classification algorithms in scikit-learn, such as Naive Bayes and Support Vector Machines (SVM). Determining which is the best to use is beyond the scope of this guide. However, the process of using them is basically the same as KNeighborsClassifier:
- Create a new instance of a class
- Set the hyperparameters
- Train the model
- Make predictions
And this consistent API is the beauty of scikit-learn!
Regression
Not all problems in supervised learning have a well known set of target values. In this case, the target may exist in a range of continuous values which cannot be enumerated. Such problems are referred to as regression, and scikit-learn provides classes for regression as well.
As with classification, we’ll need a dataset. The fetch_california_housing() method from the sklearn.datasets module will download a dataset that contains data about housing prices in the state of California.
housing_data = sklearn.datasets.fetch_california_housing()
Like the wine dataset, this dictionary has keys for the data, target, and feature_names. However, it does not provide the target names as we don’t know the precise targets—they are continuous. Before training the model, I want to introduce another task in machine learning, and in scikit-learn: feature scaling.
Take a look at the first five observations in the data:
np.set_printoptions(suppress=True) # suppress scientific notation
housing_data['data'][:5]
As you can see, the ranges of the data are different, and this can introduce issues when it comes to accurately training the model. Feature scaling will normalize the values so they fit within a common range but still maintain the relationships to each other. Scikit-learn provides the MinMaxScaler class in the sklearn.preprocessing module that simply takes the dataset and returns the scaled values:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler()
X = sc.fit_transform(housing_data[‘data’])
y = sc.fit_transform(housing_data[‘target’].reshape(1, -1)) # reshape() transforms row vector to column vector
The capital X is often used to represent the features in a machine learning dataset and lowercase y to represent the targets. Taking a look at the first five rows in the scaled features:
X[:5]
The ranges are now consistent. To split this dataset into training and test sets, I’ll use the train_test_split() function in the sklearn.model_selection module. It takes the features, targets, and a percentage of the data to be used for testing. The return value is a tuple with the training features, testing features, training targets, and testing targets:
training_features, testing_features, training_targets, testing_targets = train_test_split(X, y, test_size=0.2)
Scikit-learn includes many regression classes, but I’ll use the simplest one as they all follow a similar pattern.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(training_features, training_targets)
model.predict(testing_features[0].reshape(1, -1))
testing_targets[0]
We’ve seen this movie before. The API is the same as the one used for classification. All we have to do is match the problem to the algorithm and select the appropriate class in scikit-learn. To verify the performance of the model, I’ll create a quick visualization.
predictions = model.predict(testing_features)
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots()
ax.scatter(testing_targets, predictions)
ax.plot([testing_targets.min(), testing_targets.max()], [predictions.min(), predictions.max()])
The dashed line represents the best fit. If the predictions and testing targets were the same, all of the points would be on this line. For the most part, the points are close to the line, indicating small errors and thus decent performance. It could be better, but again, these small datasets do not react in the same way as larger datasets for production.
Conclusion
To reiterate, the main steps to using scikit-learn are:
- Match a problem to an algorithm
- Select a class which implements the algorithm
- Create an instance of the class
- Set the hyperparameters (if any, the LinearRegression class did not require any)
- Train the model with the fit() method
- Make predictions with the predict() method
- Refine the hyperparameters, rinse, and repeat to improve the performance of the model
Using the foundation in this guide, you can begin to study the details of the classes, which include more supporting utilities and more hyperparameters. This will prove that scikit-learn is very capable. There is no need to reinvent the wheel or get lost in the specifics. Scikit-learn is a fine choice to begin the field of machine learning. Thanks for reading this guide!