Text Generation Using Recurrent Neural Networks
Dec 5, 2019 • 14 Minute Read
Introduction
Text generation is one of the defining aspects of natural language processing (NLP), wherein computer algorithms try to make sense of text available in the free forms in a certain language or try to create similar text using training examples. Text generation has been notoriously difficult for shallow learning techniques, but deep learning algorithms and especially recurrent neural networks (RNNs) have infused a new vigor to the field after decades of stagnation.
In this guide, we will learn about basic text generation using Recurrent Neural Networks in Python and the example of Paradise Lost by John Milton. The book can be freely found as part of Project Gutenberg, which houses some of the classics of world literature.
Recurrent Neural Networks (RNNs)
The deep networks that are used in image classification convents and structured data dense nets take data all at once without any memory associated with it. They are essentially feedforward networks. The whole dataset is converted to the tensor and fed to the network, which in turn adjusts its weights and biases to satisfy the training data and make reasonable predictions.
RNNs, on the other hand, keep the memory of the sequence that appears before in order to guess what's coming next in the sequence. This, in principle, makes the network more able to adapt to data that depend on the previous data. For example, the word "normal" can have different meanings in different contexts. If the text involves a sample and statistics, then the word "normal" is likely to be followed by the word "distribution." Similarly, if the word shows up in a sentence that has geometrical references, then it may mean "perpendicular." Similarly, if the word "normal" shows up in reference to chemistry, it is likely to be followed by a solution. RNNs are capable of understanding these kinds of context-specific nuances and producing the desired results. One of the variants of RNNs is long short term memory, which we are going to use in this guide.
Text Preprocessing
Before the data can be fed to the neural network, it needs to be reasonably processed so that mathematical models for deep learning are able to accept it. The below code block reads the book Paradise Lost and converts all the characters to lower case (text_1). Once it is done, it tokenizes the data such that all the characters are provided a unique integer (char_index). These integers are the ones that will eventually be fed to the network in the form of tensors. Similarly, a reverse mapping (index_char) is also created, which will be used to create the text from the integers that are given out from the output layer. One thing to note is that we are using character-level encoding. One other common technique used is that of word encoding.
import numpy as np
from tensorflow import keras
from keras_preprocessing.text import Tokenizer
tokenizer = Tokenizer(char_level=True)
filename = "D:\\book\\paradise_lost.txt"
# read the file and covert to lowercase
text_1 = open(filename, 'r').read().lower()
print('total number of characters in book: ', len(text_1))
# create mapping of unique chars to integers and reverse
tokenizer.fit_on_texts(text_1)
char_index = tokenizer.word_index
print('Found %s unique characters. ' % len(char_index))
print('char to integer dictionary: ',char_index)
index_char = dict(enumerate(char_index.keys()))
print('integer to char dictionary: ',index_char)
Below is the output is given by the above code. It is worth noting that there are 48 unique characters in the book, including newline characters and some other special characters.
Output
total number of characters in book: 460069
Found 48 unique characters.
char to integer dictionary: {' ': 1, 'e': 2, 't': 3, 'o': 4, 'a': 5, 'n': 6, 'h': 7, 's': 8, 'i': 9, 'r': 10, 'd': 11, 'l': 12, 'u': 13, '\n': 14, ',': 15, 'm': 16, 'w': 17, 'f': 18, 'c': 19, 'g': 20, 'p': 21, 'b': 22, 'y': 23, 'v': 24, ';': 25, 'k': 26, '.': 27, ':': 28, '-': 29, "'": 30, 'x': 31, 'j': 32, '?': 33, '!': 34, 'q': 35, 'z': 36, '"': 37, '(': 38, ')': 39, '0': 40, '1': 41, '9': 42, '6': 43, '8': 44, '4': 45, '5': 46, '7': 47, '3': 48}
integer to char dictionary: {0: ' ', 1: 'e', 2: 't', 3: 'o', 4: 'a', 5: 'n', 6: 'h', 7: 's', 8: 'i', 9: 'r', 10: 'd', 11: 'l', 12: 'u', 13: '\n', 14: ',', 15: 'm', 16: 'w', 17: 'f', 18: 'c', 19: 'g', 20: 'p', 21: 'b', 22: 'y', 23: 'v', 24: ';', 25: 'k', 26: '.', 27: ':', 28: '-', 29: "'", 30: 'x', 31: 'j', 32: '?', 33: '!', 34: 'q', 35: 'z', 36: '"', 37: '(', 38: ')', 39: '0', 40: '1', 41: '9', 42: '6', 43: '8', 44: '4', 45: '5', 46: '7', 47: '3'}
Creating Input Tensor and Output Vectors
The next step is to prepare the data as input and output sets. We are going to take the sequence of length five, which will then be used to predict the next character.
char_len = len(text_1)
seq_length = 5
data_X = []
data_y = []
for i in range(0, char_len - seq_length, 1):
input_seq = text_1[i:i + seq_length]
output_seq = text_1[i + seq_length]
data_X.append([char_index[char] for char in input_seq])
data_y.append(char_index[output_seq])
n_patterns = len(data_X)
print("Total Patterns: ", n_patterns)
print('###########print first 10 elements of list data_X###########')
print(data_X[:10])
print('###########print first 10 elements of list data_y###########')
print(data_y[:10])
Output
###########print first 10 elements of list data_X###########
[[21, 5, 10, 5, 11], [5, 10, 5, 11, 9], [10, 5, 11, 9, 8], [5, 11, 9, 8, 2], [11, 9, 8, 2, 1], [9, 8, 2, 1, 12], [8, 2, 1, 12, 4], [2, 1, 12, 4, 8], [1, 12, 4, 8, 3], [12, 4, 8, 3, 1]]
###########print first 10 elements of list data_y###########
[9, 8, 2, 1, 12, 4, 8, 3, 1, 22]
Once we are ready with the input and output lists, we need to create the numpy arrays of the desired shape to be fed into the model. Once reshaping is complete, the array needs to be normalized by dividing each element of the array by the number of unique characters in the book (for our example, the value is 48). Also, the y vector is one-hot encoded. You can read more about one-hot encoding here .
X = np.reshape(data_X, (n_patterns, seq_length, 1))
X = X / len(char_index)
print("###########first 3 elements of X###########")
print(X[:3])
y = keras.utils.to_categorical(data_y)
print("###########first 3 elements of y###########")
print(y[:3])
Output
###########first 3 elements of X###########
[[[0.4375 ]
[0.10416667]
[0.20833333]
[0.10416667]
[0.22916667]]
[[0.10416667]
[0.20833333]
[0.10416667]
[0.22916667]
[0.1875 ]]
[[0.20833333]
[0.10416667]
[0.22916667]
[0.1875 ]
[0.16666667]]]
###########first 3 elements of y###########
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0.]]
Defining the Model
Once we are ready with our input and output datasets, it is time to start building the recurrent neural network. For this guide, we are going to use long short-term memory (LSTM). You can also use gated recurrent units (GRU) as a substitute for LSTM.
model = keras.Sequential([
keras.layers.LSTM((256), return_sequences=False, input_shape=(X.shape[1], X.shape[2])),
keras.layers.Dropout(0.2),
keras.layers.Dense(y.shape[1], activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy')
print(model.summary())
Output
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm (LSTM) (None, 256) 264192
_________________________________________________________________
dropout (Dropout) (None, 256) 0
_________________________________________________________________
dense (Dense) (None, 49) 12593
=================================================================
Total params: 276,785
Trainable params: 276,785
Non-trainable params: 0
_________________________________________________________________
None
Fitting the Model and Saving
The model is now ready to get fit. Before we start the fitting process, it is good practice to save the model given out by each epoch. Since RNNs are very computationally intensive and slow in the CPU environment, we can save the model given out after each epoch with adjusted weights and biases. The name of the saved model can have the loss value to ease out the separation of the best model attained yet.
filepath="weights-improvement-{epoch}-{loss:.4f}.hdf5"
checkpoint = keras.callbacks.ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(X, y, epochs=30, batch_size=128, callbacks=callbacks_list)
The network we are building is going to undergo the training process for 30 epochs with a batch size of 128. These hyperparameters need to be experimented with and adjusted to get the best possible result given the resources the developer has in hand.
Output
Epoch 1/2
2019-11-26 09:04:32.661058: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
460032/460064 [============================>.] - ETA: 0s - loss: 2.9342
Epoch 00001: loss improved from inf to 2.93420, saving model to weights-improvement-01-2.9342.hdf5
460064/460064 [==============================] - 224s 487us/sample - loss: 2.9342
Epoch 2/2
460032/460064 [============================>.] - ETA: 0s - loss: 2.7909
Epoch 00002: loss improved from 2.93420 to 2.79091, saving model to weights-improvement-02-2.7909.hdf5
460064/460064 [==============================] - 215s 468us/sample - loss: 2.7909
Loading the Model and Text Generation
The following program is used to load the saved model and then predict the text of five hundred characters using a randomly picked seed of one hundred characters.
import numpy as np
from tensorflow import keras
from keras_preprocessing.text import Tokenizer
tokenizer = Tokenizer(char_level=True)
filename = "D:\\book\\paradise_lost.txt"
# read the file and covert to lowercase
text_1 = open(filename, 'r').read().lower()
print('total number of characters in book: ', len(text_1))
# create mapping of unique chars to integers and reverse
tokenizer.fit_on_texts(text_1)
char_index = tokenizer.word_index
print('Found %s unique characters. ' % len(char_index))
print('char to integer dictionary: ',char_index)
index_char = dict(enumerate(char_index.keys()))
print('integer to char dictionary: ',index_char)
char_len = len(text_1)
seq_length = 5
data_X = []
data_y = []
for i in range(0, char_len - seq_length, 1):
input_seq = text_1[i:i + seq_length]
output_seq = text_1[i + seq_length]
data_X.append([char_index[char] for char in input_seq])
data_y.append(char_index[output_seq])
n_patterns = len(data_X)
X = np.reshape(data_X, (n_patterns, seq_length, 1))
X = X / len(char_index)
y = keras.utils.to_categorical(data_y)
model = keras.Sequential([
keras.layers.LSTM((256), return_sequences=False, input_shape=(X.shape[1], X.shape[2])),
keras.layers.Dropout(0.2),
keras.layers.Dense(y.shape[1], activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy')
# Loading the weights file
model.load_weights("C:\\guides\\weights-improvement-02-2.7909.hdf5")
txt_fl = []
print(len(data_X))
start = np.random.randint(50,100 )
print(start)
pattern = data_X[start]
print([''.join(index_char[value]) for value in pattern])
# generate characters
# generate characters
for i in range(1000):
x = np.reshape(pattern, (1, len(pattern), 1))
x = x / len(char_index)
prediction = model.predict(x, verbose=0)
index = np.argmax(prediction)
result = index_char[index].rstrip('\n\r')
seq_in = [index_char[value] for value in pattern]
#print(result)
txt_fl.append(result)
pattern.append(index)
pattern = pattern[1:len(pattern)]
print(''.join(txt_fl))
Output
red the saarof the siren of the siahe of heaven,and the siren oi his searen sored the soaeeof the siren of the siahe of heaven,and the siren oi his searen sored the soaeeof the siren of the siahe of heaven,and the siren oi his searen sored the soaeeof the siren of the siahe of heaven,and the siren oi his searen sored the soaeeof the siren of the siahe of heaven,and the siren oi his searen sored the soaeeof the siren of the siahe of heaven,and the siren oi his searen sored the soaeeo
Although the above text looks absurd and unintelligible, there are still patterns that it can infer. It is able to understand the lines and create words, which is an achievement in itself. With higher training, much more sophisticated patterns can be created.
Conclusion
It should be kept in mind that text generation techniques, however advanced, do not understand meaning in the human sense of the word. The machine learning model is trying to create a mathematical model that takes cues from the available training data. It then tries to generate text based on the mathematical rules it has inferred as part of the training. Also, it needs to be understood that RNNs are very computationally expensive. The problem you are trying to infer may soon become intractable for CPUs. RNNs are best worked in graphical processing units (GPUs).