Data Visualization for Deep Learning Model Using Matplotlib
Learn how to visualize your data using Matplotlib library to make informed decisions and improve the Machine Learning Model.
Jun 14, 2019 • 8 Minute Read
Introduction
Visualization of the performance of any machine learning model is an easy way to make sense of the data being poured out of the model and make an informed decision about the changes that need to be made on the parameters or hyperparameters that affects the Machine Learning model. In this guide, we are going to learn how to visualize the data using Matplotlib library and integrate it with the deep learning model to make informed decisions and improve the Machine Learning Model.
Reasons to Visualize
-
To evaluate the Underfitting or Overfitting: One of the primary difficulties in any Machine Learning approach is to make the model generalized so that it is good in predicting reasonable!e results with the new data and not just on the data it has already been trained on. Visualizing the training loss vs. validation loss or training accuracy vs. validation accuracy over a number of epochs is a good way to determine if the model has been sufficiently trained. This is important so that the model is not undertrained and not overtrained such that it starts memorizing the training data which will, in turn, reduce its ability to predict accurately.
-
To adjust the Hyperparameters: Hyperparameters such as the number of nodes per layer of the Neural Network and the number of layers in the Network can make a significant impact on the performance of the Model. Visualization of the fitness of the training and validation set data can help to optimize these values and in building a better model.
Matplotlib to Generate the Graphs
We are going to import the data from a .csv file and then split it across three sets: Train, Validation, and Test. The train data will be used to train the model while the validation model will be used to test the fitness of the model. After each run, users can make adjustments to the hyperparameters such as the number of layers in the network, the number of nodes per layer, number of epochs, etc. These adjustments are mostly use on a trial and error basis and the visualization tools, such as the plots given out by Matplotlib, do help in getting to desirable results. The Test Set must not be involved in the training exercise at the parameter or hyperparameter level. In case the user intentionally or unintentionally uses the test data for training purpose, the test data will not be able to accurately predict the generalization power of the model.
The below program builds the Deep Learning Model for Binary Classification. The data is split into three sets:
- Training set
- Validation set
- Test set
The original data set is split such that 20% of the entire data is assigned as a test set and the rest remains as the training set. The train set is again split such that 20% of the train set is assigned as the validation set and the rest is used for the training purpose. Of the entire data set, 64% is treated as the training set, 16% as the validation set, and 20% as the test set. The training data set is fed to the three-layered Neural networks; with the first two layers having four nodes each and the output layer with just one node. The loss and accuracy data of the model for each epoch is stored in the history object.
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('C:\\ml\\molecular_activity.csv')
properties = list(df.columns.values)
properties.remove('Activity')
print(properties)
X = df[properties]
y = df['Activity']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
model = keras.Sequential([
keras.layers.Flatten(input_shape=(4,)),
keras.layers.Dense(4, activation=tf.nn.relu),
keras.layers.Dense(4, activation=tf.nn.relu),
keras.layers.Dense(1, activation=tf.nn.sigmoid),
])
model.compile(optimizer='adam',
loss='mse',
metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=34, batch_size=1, validation_data=(X_val, y_val))
The below snippet plots the graph of the training loss vs. validation loss over the number of epochs. This will help the developer of the model to make informed decisions about the architectural choices that need to be made.
loss_train = history.history['train_loss']
loss_val = history.history['val_loss']
epochs = range(1,35)
plt.plot(epochs, loss_train, 'g', label='Training loss')
plt.plot(epochs, loss_val, 'b', label='validation loss')
plt.title('Training and Validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Output
The following snippet plots the graph of training accuracy vs. validation accuracy over the number of epochs.
loss_train = history.history['acc']
loss_val = history.history['val_acc']
epochs = range(1,11)
plt.plot(epochs, loss_train, 'g', label='Training accuracy')
plt.plot(epochs, loss_val, 'b', label='validation accuracy')
plt.title('Training and Validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
Output
Epoch 1/10
345/345 [==============================] - 1s 2ms/sample - loss: 0.3043 - acc: 0.6957 - val_loss: 0.3563 - val_acc: 0.6437
Epoch 2/10
345/345 [==============================] - 0s 1ms/sample - loss: 0.3043 - acc: 0.6957 - val_loss: 0.3563 - val_acc: 0.6437
Epoch 3/10
345/345 [==============================] - 1s 2ms/sample - loss: 0.3043 - acc: 0.6957 - val_loss: 0.3563 - val_acc: 0.6437
Epoch 4/10
345/345 [==============================] - 0s 1ms/sample - loss: 0.3043 - acc: 0.6957 - val_loss: 0.3563 - val_acc: 0.6437
Epoch 5/10
345/345 [==============================] - 0s 1ms/sample - loss: 0.3043 - acc: 0.6957 - val_loss: 0.3563 - val_acc: 0.6437
Epoch 6/10
345/345 [==============================] - 0s 1ms/sample - loss: 0.3043 - acc: 0.6957 - val_loss: 0.3563 - val_acc: 0.6437
Epoch 7/10
345/345 [==============================] - 0s 1ms/sample - loss: 0.3043 - acc: 0.6957 - val_loss: 0.3563 - val_acc: 0.6437
Epoch 8/10
345/345 [==============================] - 0s 1ms/sample - loss: 0.3043 - acc: 0.6957 - val_loss: 0.3563 - val_acc: 0.6437
Epoch 9/10
345/345 [==============================] - 0s 1ms/sample - loss: 0.3043 - acc: 0.6957 - val_loss: 0.3563 - val_acc: 0.6437
Epoch 10/10
345/345 [==============================] - 1s 1ms/sample - loss: 0.3043 - acc: 0.6957 - val_loss: 0.3563 - val_acc: 0.6437
Conclusion
Visualizing data is one of the best ways to humanize data to make it easy to understand and get the relevant trends from it. This activity can be crucial when the user is still trying to optimize the model and make it production ready. Matplotlib library offers many different tools to help in this visualization process. Users can choose to create graphs such as Line Plots, Histograms, Three-dimensional plots, Steamplots, Bar charts, Pie charts, Tables, Scatter plots, etc. based on the demand of the problem at hand.
Appendix
I have compiled the complete data set which can be found at my GitHub.