Designing a Machine Learning Model

In this guide, we are going to implement a logistic regression model from scratch and compare its accuracy with the scikit-learn logistic regression package.

By Gaurav Singhal

Jan 27, 2020 • 16 Minute Read

Subscribe to the newsletter

Introduction

In this guide, we are going to implement a logistic regression model from scratch and compare its accuracy with the scikit-learn logistic regression package. Logistic regression is part of the classification technique of machine learning, which solves many problems in data science.

Logistic regression is also one of the simplest and most commonly used models. It implements a baseline for any binary classification problem with outcomes that are either True/False or Yes/No. It can predict whether mail is spam or predict diabetes in an individual, but it can't predict things like house prices. Another category is multinomial classification, in which more than two classes are determined, such as whether the weather will be sunny, rainy or humid, the species of animals, etc.

In this guide, we will explore a known multi-class problem using the iris dataset from the UCI Machine Learning Repository to predict the species of flowers based on their given dimensions.

You can download the dataset here. Please note that you will need to have a Kaggel account to get the dataset.

By the end of this guide ...

You will be able to draw insights from the dataset by visualizing it. Visualization is like telling a story from text data, making analysis more efficient.
You will build an ML algorithm from scratch by converting mathematical steps into running code. It will also become easier to understand the mechanics behind the algorithm.
You will successfully design a logistic regression machine learning model that you can showcase on different data science platforms.

So let us begin our journey!

Introduction to the Dataset

The iris dataset contains observations of three iris species: Iris-setosa, Iris-versicolor, and Iris-virginica.

There are 50 observations of each species for a total of 150 observations with 4 features each (sepal length, sepal width, petal length, petal width). Based on the combination of the mentioned four features, data scientist R. A. Fisher developed a linear discriminant model to distinguish the species from each other.

Before we start, we need to define some ML terms.

Attributes (features): An attribute is a list of specifications in a dataset used to determine classification. In this case, the attributes are the petals' and sepals' length and width.
Target variable: In the context of ML, this is the variable that is or should be the output. Here the target variables are the three flower species.

          # import libraries
from subprocess import check_output
import numpy as np # linear algebra
import pandas as pd # data processing
import warnings 
warnings.filterwarnings('ignore') #ignore warnings
from math import ceil 

#Visualization
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.metrics import confusion_matrix #Confusion matrix
from sklearn.metrics import accuracy_score # Accuracy score

# Spliting training and testing
from sklearn.model_selection import train_test_split

#Advanced optimization
from scipy import optimize as op
    

Loading and Previewing the Data

          # Loading the data
data_iris = pd.read_csv('Iris.csv') # If your input csv file is placed with working directory
# data_iris=pd.read_csv('../input/Iris.csv') # Enter the path of the directory where input csv is stored
    

      data_iris.head() # first 5 entries of the dataset

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	1	5.1	3.5	1.4	0.2	Iris-setosa
1	2	4.9	3.0	1.4	0.2	Iris-setosa
2	3	4.7	3.2	1.3	0.2	Iris-setosa
3	4	4.6	3.1	1.5	0.2	Iris-setosa
4	5	5.0	3.6	1.4	0.2	Iris-setosa

      data_iris.tail() # last 5 entries of dataset

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
145	146	6.7	3.0	5.2	2.3	Iris-virginica
146	147	6.3	2.5	5.0	1.9	Iris-virginica
147	148	6.5	3.0	5.2	2.0	Iris-virginica
148	149	6.2	3.4	5.4	2.3	Iris-virginica
149	150	5.9	3.0	5.1	1.8	Iris-virginica

      data_iris.info()

The following command gives the descriptive statistics (percentile, mean, standard deviation) for all 150 observations.

      data_iris.describe()

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm
count	150.000000	150.000000	150.000000	150.000000	150.000000
mean	75.500000	5.843333	3.054000	3.758667	1.198667
std	43.445368	0.828066	0.433594	1.764420	0.763161
min	1.000000	4.300000	2.000000	1.000000	0.100000
25%	38.250000	5.100000	2.800000	1.600000	0.300000
50%	75.500000	5.800000	3.000000	4.350000	1.300000
75%	112.750000	6.400000	3.300000	5.100000	1.800000
max	150.000000	7.900000	4.400000	6.900000	2.500000

      data_iris['Species'].value_counts()

          Iris-setosa        50
Iris-virginica     50
Iris-versicolor    50
Name: Species, dtype: int64
    

Check for missing values.

      data_iris.isnull().values.any()

Great! There is no missing data. Now let's describe the statistical details for each flower species.

          print('Iris-setosa')
setosa = data_iris['Species'] == 'Iris-setosa'
print(data_iris[setosa].describe())
print('\nIris-versicolor')
versicolor = data_iris['Species'] == 'Iris-versicolor'
print(data_iris[versicolor].describe())
print('\nIris-virginica')
virginica = data_iris['Species'] == 'Iris-virginica'
print(data_iris[virginica].describe())
    

Visualizations

          # The Histogram representation of the univariate plots for each measurement

np = data_iris.drop('Id', axis=1) # dropping the Id 
np.hist(edgecolor='blue', linewidth=1.2)
fig=plt.gcf()
fig.set_size_inches(12,6)
plt.show()
    

          # ploting scatter plot with respect to petal length

petalPlt = sb.FacetGrid(data_iris, hue="Species", size=6).map(plt.scatter, "PetalLengthCm", "PetalWidthCm")
plt.legend(loc='upper left');
plt.title("Petal Length VS Width");
    

          # Plotting scatter plot with respect to sepal length
sepalPlt = sb.FacetGrid(data_iris, hue="Species", size=6).map(plt.scatter, "SepalLengthCm", "SepalWidthCm")
plt.legend(loc='upper right');
plt.title("Sepal Length VS Width")
    

Here we can see that the petal features are giving a better cluster division. Let's check the bivariate relation between each pair of features.

          # Using seaborn pairplot to see the bivariate relation between each pair of features

import seaborn as sns
sns.set_palette('husl')

nl = data_iris.drop('Id', axis=1) # dropping the Id 
b = sns.pairplot(nl,hue="Species",diag_kind="kde", markers='+',size =3 );
plt.show()
    

Takeaways from the above visualization:

In the above plots, the diagonal grouping of the pairs of attributes suggests a high correlation and a predictable relationship.
The relationship between pairs of features of an Iris-Setosa (in pink) is distinctly different from those of the other two species.
There is an overlap in the pairwise relationships of the other two species, Iris-Versicolor (brown) and Iris-Virginia (green).

          # Data setup
import numpy as np
Species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
# Number of examples
m = data_iris.shape[0]
# Features
n = 4
# Number of classes
k = 3

X = np.ones((m,n + 1))
y = np.array((m,1))
X[:,1] = data_iris['PetalLengthCm'].values
X[:,2] = data_iris['PetalWidthCm'].values
X[:,3] = data_iris['SepalLengthCm'].values
X[:,4] = data_iris['SepalWidthCm'].values

# Labels
y = data_iris['Species'].values

# Mean normalization
for j in range(n):
    X[:, j] = (X[:, j] - X[:,j].mean())
    

          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 11)
# it shows 80% of data is split for training and 20% of the data goes to testing.

X = data_iris.drop(['Id', 'Species'], axis=1)
y = data_iris['Species']
# print(X.head())
print(X_train.shape)
# print(y.head())
print(y_test.shape)
    

          (120, 5)
(30,)
    

Logistic Regression from Scratch

Picking Up the Link Function

Regularization

Regularization addresses the error of over-fitting or under-fitting the model while training the data. Over-fitting occurs when the variance is high and the model is complicated with lots of unnecessary curves and angles. The model is best fitted for the training data, but it performs poorly while testing the data. Under-fitting occurs when the variance is low and the model is too simple. It will give the best accuracy while testing the data, but it will not fit the training data well. When we apply regularization, the model adjusts θj without neglecting any features. The regularization parameter is added at the end of cost function and gradient descent function equations.

Regularized Cost Function

Regularized Gradient

There are multiple ways to write mathematical equations into code format. Choose the method you are comfortable with, and make sure it justifies the mathematical expression correctly.

Putting it All Together

          # Sigmoid function
def sigmoid(z):
    return 1.0 / (1 + np.exp(-z))
#____________________________________________________________________________#
# Regularized cost function
def reglrCostFunction(theta, X, y, lambda_s = 0.1):
    m = len(y)
    h = sigmoid(X.dot(theta))
    J = (1 / m) * (-y.T.dot(np.log(h)) - (1 - y).T.dot(np.log(1 - h)))
    reg = (lambda_s/(2 * m)) * np.sum(theta**2)
    J = J + reg
 
    return J
#____________________________________________________________________________#
# Regularized gradient function
def reglrGradient(theta, X, y, lambda_s = 0.1):
    m, n = X.shape
    theta = theta.reshape((n, 1))
    y = y.reshape((m, 1))
    h = sigmoid(X.dot(theta))
    reg = lambda_s * theta /m
    gd = ((1 / m) * X.T.dot(h - y)) 
    gd = gd + reg

    return gd
#____________________________________________________________________________#
# Optimizing logistic regression (theta)
def logisticRegression(X, y, theta):
    result = op.minimize(fun = reglrCostFunction, x0 = theta, args = (X, y),
                         method = 'TNC', jac = reglrGradient)
    
    return result.x
    

          # Training
all_theta = np.zeros((k, n + 1))

# One vs all
i = 0
for flower in Species:
    tmp_y = np.array(y_train == flower, dtype = int)
    optTheta = logisticRegression(X_train, tmp_y, np.zeros((n + 1,1)))
    all_theta[i] = optTheta
    i += 1
    

          # Predictions
Prob = sigmoid(X_test.dot(all_theta.T)) # probability for each flower
pred = [Species[np.argmax(Prob[i, :])] for i in range(X_test.shape[0])]

print(" Test Accuracy ", accuracy_score(y_test, pred) * 100 , '%')
    

      Test Accuracy  96.66666666666667 %

Our model is performing well, as the accuracy is approximately 97%. Let's find out the correlation between features in the dataset using a heat map. A high positive or negative value will show that the features have a high correlation.

          # Confusion Matrix
cnfm = confusion_matrix(y_test, pred, labels = Species)

sb.heatmap(cnfm, annot = True, xticklabels = Species, yticklabels = Species);
    

          # classification report 
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))
    

Logistic Regression with Scikit-Learn

Our math looks fine. It is pretty clear that our functions are working, and the output is perfect. However, there is nothing wrong with verifying truth. Let's check whether we get the same results with scikit-learn logistic regression.

Steps To Apply Algorithm

After splitting data into training and testing datasets (consider the above train and test variables), select an algorithm based on the problem.
To fit the model, pass the training dataset to the algorithm using the .fit() method.
To predict the outcome, the testing data is passed to the trained algorithm using the .predict() method.
To check the accuracy, pass the predicted and actual outcomes to the model.

          from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Test Accuracy for Scikit-Learn model:', metrics.accuracy_score(y_test, y_pred)* 100,'%')
    

      Test Accuracy for Scikit-Learn model: 96.66666666666667 %

          from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
    

The accuracy of the logistic regression model using functions and using scikit-learn is a match! (Which makes sense.)

Conclusion

Although we have accuracy nearing 97%, there is still room for improvement. It is always preferred to use the sklearn logistic regression model in production, as it takes less processing time (compare lines of code), and the package uses a highly optimized solver.

So, when does it come in handy to build algorithms from scratch? When you have to design a model to fit more complex problems or the problem of a new domain.

For this guide, we have selected a simple dataset with fewer features. Just keep in mind that the performance and selection of techniques are dependent on the volume and variety of the data.

One final tip: don't worry if the algorithm is not as optimized and fancy as other exciting packages. Those packages are the result of a strong base and consistent improvement in development.

I hope you liked this guide. If you have any queries, feel free to contact me at CodeAlphabet.

Gaurav S.

Guarav is a Data Scientist with a strong background in computer science and mathematics. He has extensive research experience in data structures, statistical data analysis, and mathematical modeling. With a solid background in Web development he works with Python, JAVA, Django, HTML, Struts, Hibernate, Vaadin, Web Scrapping, Angular, and React. His data science skills include Python, Matplotlib, Tensorflows, Pandas, Numpy, Keras, CNN, ANN, NLP, Recommenders, Predictive analysis. He has built systems that have used both basic machine learning algorithms and complex deep neural network. He has worked in many data science projects, some of them are product recommendation, user sentiments, twitter bots, information retrieval, predictive analysis, data mining, image segmentation, SVMs, RandomForest etc.

More about this author