Designing a Machine Learning Model
In this guide, we are going to implement a logistic regression model from scratch and compare its accuracy with the scikit-learn logistic regression package.
Jan 27, 2020 • 16 Minute Read
Introduction
In this guide, we are going to implement a logistic regression model from scratch and compare its accuracy with the scikit-learn logistic regression package. Logistic regression is part of the classification technique of machine learning, which solves many problems in data science.
Logistic regression is also one of the simplest and most commonly used models. It implements a baseline for any binary classification problem with outcomes that are either True/False or Yes/No. It can predict whether mail is spam or predict diabetes in an individual, but it can't predict things like house prices. Another category is multinomial classification, in which more than two classes are determined, such as whether the weather will be sunny, rainy or humid, the species of animals, etc.
In this guide, we will explore a known multi-class problem using the iris dataset from the UCI Machine Learning Repository to predict the species of flowers based on their given dimensions.
You can download the dataset here. Please note that you will need to have a Kaggel account to get the dataset.
By the end of this guide ...
- You will be able to draw insights from the dataset by visualizing it. Visualization is like telling a story from text data, making analysis more efficient.
- You will build an ML algorithm from scratch by converting mathematical steps into running code. It will also become easier to understand the mechanics behind the algorithm.
- You will successfully design a logistic regression machine learning model that you can showcase on different data science platforms.
So let us begin our journey!
Introduction to the Dataset
The iris dataset contains observations of three iris species: Iris-setosa, Iris-versicolor, and Iris-virginica.
There are 50 observations of each species for a total of 150 observations with 4 features each (sepal length, sepal width, petal length, petal width). Based on the combination of the mentioned four features, data scientist R. A. Fisher developed a linear discriminant model to distinguish the species from each other.
Before we start, we need to define some ML terms.
-
Attributes (features): An attribute is a list of specifications in a dataset used to determine classification. In this case, the attributes are the petals' and sepals' length and width.
-
Target variable: In the context of ML, this is the variable that is or should be the output. Here the target variables are the three flower species.
# import libraries
from subprocess import check_output
import numpy as np # linear algebra
import pandas as pd # data processing
import warnings
warnings.filterwarnings('ignore') #ignore warnings
from math import ceil
#Visualization
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.metrics import confusion_matrix #Confusion matrix
from sklearn.metrics import accuracy_score # Accuracy score
# Spliting training and testing
from sklearn.model_selection import train_test_split
#Advanced optimization
from scipy import optimize as op
Loading and Previewing the Data
# Loading the data
data_iris = pd.read_csv('Iris.csv') # If your input csv file is placed with working directory
# data_iris=pd.read_csv('../input/Iris.csv') # Enter the path of the directory where input csv is stored
data_iris.head() # first 5 entries of the dataset
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
data_iris.tail() # last 5 entries of dataset
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
145 | 146 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
146 | 147 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
147 | 148 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
148 | 149 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
149 | 150 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
data_iris.info()
The following command gives the descriptive statistics (percentile, mean, standard deviation) for all 150 observations.
data_iris.describe()
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | |
---|---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 75.500000 | 5.843333 | 3.054000 | 3.758667 | 1.198667 |
std | 43.445368 | 0.828066 | 0.433594 | 1.764420 | 0.763161 |
min | 1.000000 | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 38.250000 | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 75.500000 | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 112.750000 | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 150.000000 | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
data_iris['Species'].value_counts()
Iris-setosa 50
Iris-virginica 50
Iris-versicolor 50
Name: Species, dtype: int64
Check for missing values.
data_iris.isnull().values.any()
Great! There is no missing data. Now let's describe the statistical details for each flower species.
print('Iris-setosa')
setosa = data_iris['Species'] == 'Iris-setosa'
print(data_iris[setosa].describe())
print('\nIris-versicolor')
versicolor = data_iris['Species'] == 'Iris-versicolor'
print(data_iris[versicolor].describe())
print('\nIris-virginica')
virginica = data_iris['Species'] == 'Iris-virginica'
print(data_iris[virginica].describe())
Visualizations
# The Histogram representation of the univariate plots for each measurement
np = data_iris.drop('Id', axis=1) # dropping the Id
np.hist(edgecolor='blue', linewidth=1.2)
fig=plt.gcf()
fig.set_size_inches(12,6)
plt.show()
# ploting scatter plot with respect to petal length
petalPlt = sb.FacetGrid(data_iris, hue="Species", size=6).map(plt.scatter, "PetalLengthCm", "PetalWidthCm")
plt.legend(loc='upper left');
plt.title("Petal Length VS Width");
# Plotting scatter plot with respect to sepal length
sepalPlt = sb.FacetGrid(data_iris, hue="Species", size=6).map(plt.scatter, "SepalLengthCm", "SepalWidthCm")
plt.legend(loc='upper right');
plt.title("Sepal Length VS Width")
Here we can see that the petal features are giving a better cluster division. Let's check the bivariate relation between each pair of features.
# Using seaborn pairplot to see the bivariate relation between each pair of features
import seaborn as sns
sns.set_palette('husl')
nl = data_iris.drop('Id', axis=1) # dropping the Id
b = sns.pairplot(nl,hue="Species",diag_kind="kde", markers='+',size =3 );
plt.show()
Takeaways from the above visualization:
- In the above plots, the diagonal grouping of the pairs of attributes suggests a high correlation and a predictable relationship.
- The relationship between pairs of features of an Iris-Setosa (in pink) is distinctly different from those of the other two species.
- There is an overlap in the pairwise relationships of the other two species, Iris-Versicolor (brown) and Iris-Virginia (green).
# Data setup
import numpy as np
Species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
# Number of examples
m = data_iris.shape[0]
# Features
n = 4
# Number of classes
k = 3
X = np.ones((m,n + 1))
y = np.array((m,1))
X[:,1] = data_iris['PetalLengthCm'].values
X[:,2] = data_iris['PetalWidthCm'].values
X[:,3] = data_iris['SepalLengthCm'].values
X[:,4] = data_iris['SepalWidthCm'].values
# Labels
y = data_iris['Species'].values
# Mean normalization
for j in range(n):
X[:, j] = (X[:, j] - X[:,j].mean())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 11)
# it shows 80% of data is split for training and 20% of the data goes to testing.
X = data_iris.drop(['Id', 'Species'], axis=1)
y = data_iris['Species']
# print(X.head())
print(X_train.shape)
# print(y.head())
print(y_test.shape)
(120, 5)
(30,)
Logistic Regression from Scratch
Picking Up the Link Function
Regularization
Regularization addresses the error of over-fitting or under-fitting the model while training the data. Over-fitting occurs when the variance is high and the model is complicated with lots of unnecessary curves and angles. The model is best fitted for the training data, but it performs poorly while testing the data. Under-fitting occurs when the variance is low and the model is too simple. It will give the best accuracy while testing the data, but it will not fit the training data well. When we apply regularization, the model adjusts θj without neglecting any features. The regularization parameter is added at the end of cost function and gradient descent function equations.
Regularized Cost Function
Regularized Gradient
There are multiple ways to write mathematical equations into code format. Choose the method you are comfortable with, and make sure it justifies the mathematical expression correctly.
Putting it All Together
# Sigmoid function
def sigmoid(z):
return 1.0 / (1 + np.exp(-z))
#____________________________________________________________________________#
# Regularized cost function
def reglrCostFunction(theta, X, y, lambda_s = 0.1):
m = len(y)
h = sigmoid(X.dot(theta))
J = (1 / m) * (-y.T.dot(np.log(h)) - (1 - y).T.dot(np.log(1 - h)))
reg = (lambda_s/(2 * m)) * np.sum(theta**2)
J = J + reg
return J
#____________________________________________________________________________#
# Regularized gradient function
def reglrGradient(theta, X, y, lambda_s = 0.1):
m, n = X.shape
theta = theta.reshape((n, 1))
y = y.reshape((m, 1))
h = sigmoid(X.dot(theta))
reg = lambda_s * theta /m
gd = ((1 / m) * X.T.dot(h - y))
gd = gd + reg
return gd
#____________________________________________________________________________#
# Optimizing logistic regression (theta)
def logisticRegression(X, y, theta):
result = op.minimize(fun = reglrCostFunction, x0 = theta, args = (X, y),
method = 'TNC', jac = reglrGradient)
return result.x
# Training
all_theta = np.zeros((k, n + 1))
# One vs all
i = 0
for flower in Species:
tmp_y = np.array(y_train == flower, dtype = int)
optTheta = logisticRegression(X_train, tmp_y, np.zeros((n + 1,1)))
all_theta[i] = optTheta
i += 1
# Predictions
Prob = sigmoid(X_test.dot(all_theta.T)) # probability for each flower
pred = [Species[np.argmax(Prob[i, :])] for i in range(X_test.shape[0])]
print(" Test Accuracy ", accuracy_score(y_test, pred) * 100 , '%')
Test Accuracy 96.66666666666667 %
Our model is performing well, as the accuracy is approximately 97%. Let's find out the correlation between features in the dataset using a heat map. A high positive or negative value will show that the features have a high correlation.
# Confusion Matrix
cnfm = confusion_matrix(y_test, pred, labels = Species)
sb.heatmap(cnfm, annot = True, xticklabels = Species, yticklabels = Species);
# classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))
Logistic Regression with Scikit-Learn
Our math looks fine. It is pretty clear that our functions are working, and the output is perfect. However, there is nothing wrong with verifying truth. Let's check whether we get the same results with scikit-learn logistic regression.
Steps To Apply Algorithm
- After splitting data into training and testing datasets (consider the above train and test variables), select an algorithm based on the problem.
- To fit the model, pass the training dataset to the algorithm using the .fit() method.
- To predict the outcome, the testing data is passed to the trained algorithm using the .predict() method.
- To check the accuracy, pass the predicted and actual outcomes to the model.
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Test Accuracy for Scikit-Learn model:', metrics.accuracy_score(y_test, y_pred)* 100,'%')
Test Accuracy for Scikit-Learn model: 96.66666666666667 %
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
The accuracy of the logistic regression model using functions and using scikit-learn is a match! (Which makes sense.)
Conclusion
Although we have accuracy nearing 97%, there is still room for improvement. It is always preferred to use the sklearn logistic regression model in production, as it takes less processing time (compare lines of code), and the package uses a highly optimized solver.
So, when does it come in handy to build algorithms from scratch? When you have to design a model to fit more complex problems or the problem of a new domain.
For this guide, we have selected a simple dataset with fewer features. Just keep in mind that the performance and selection of techniques are dependent on the volume and variety of the data.
One final tip: don't worry if the algorithm is not as optimized and fancy as other exciting packages. Those packages are the result of a strong base and consistent improvement in development.
I hope you liked this guide. If you have any queries, feel free to contact me at CodeAlphabet.