Natural Language Processing – Text Parsing
Apr 23, 2019 • 14 Minute Read
Introduction
Natural Language Processing (NLP) has gained a lot of traction as a sub-field of Artificial Intelligence. It is focused on enabling computers to understand and process human languages. Some common applications include Chatbots, Sentiment Analysis, Translation, Spam Classification, and many more.
However, there is a significant difference between NLP and traditional machine learning tasks, with the former dealing with unstructured text data while the latter deals with structured tabular data. Hence, it is necessary to understand how to deal with text before applying machine learning techniques to it. This is where text parsing comes into the picture.
So, what is text parsing? In simple terms, it is a common programming task that separates the given series of text into smaller components based on some rules. Its application ranges from document parsing to deep learning NLP.
In this guide, we will be applying the rich functionalities available within python to do text parsing. The two popular options are regular expressions and word tokenization.
Regular Expressions
Regular Expressions, or Regex, are strings with a special syntax that allow us to match patterns in other strings. In python, there is a module called re to work with regular expressions. Some of the common regex patterns and its usage are shown below:
-
'\d': matches any decimal digit; for example, 5. A variance of this is '\D' that matches any non-digit character.
-
'\s': matches any whitespace character; for example, ''. A variance of this is '\S' that matches any non-whitespace character.
-
'\w': matches any alphanumeric character; for example, 'Pluralsight'. A variance of this is '\W' that matches any non-alphanumeric character.
-
'+ or *': performs a greedy match. For example, 'eeeeee'.
-
‘[a-z]’: matches lowercase groups.
-
‘[A-Za-z]’: matches upper and lowercase English alphabet.
-
‘[0-9]’: matches numbers from 0 to 9.
These are a few of the many regex patterns available. We will understand regex better in the subsequent sections of this guide with the help of examples. We will start by importing the re module which is done in the first line of code below. We will also need a text object or corpus, for which we use the brief description of the popular movie, 'Avengers'. We store this text in the corpus 'regex_example'. This is done in the second line of code.
import re
regex_example = "Avengers: Infinity War was a 2018 American superhero film based on the Marvel Comics superhero team the Avengers. It is the 19th film in the Marvel Cinematic Universe (MCU). The running time of the movie was 149 minutes and the box office collection was around 2 billion dollars. (Source: Wikipedia)"
print(regex_example)
Output: Avengers: Infinity War was a 2018 American superhero film based on the Marvel Comics superhero team the Avengers. It is the 19th film in the Marvel Cinematic Universe (MCU). The running time of the movie was 149 minutes and the box office collection was around 2 billion dollars. (Source: Wikipedia)
Common Python Regex Methods
re.findall()
The re.findall() method returns all patterns in the string.In case of no match, an empty list is returned.
The first line of code below extracts numbers from the text, 'regex_example', that we created earlier.
We will now work with words and try to find the number of vowels in the text. The second and third lines of code perform this task. There are 87 occurrences of vowels in the text.
Suppose you want to find the number of times the word 'Avengers' was used in the corpus. This is achieved by the fourth line of code and the answer is 2.
We can also find all capitalized words and print the result, which is done in fifth and sixth lines of the code. The output only contains the capitalized words of the corpus.
print(re.findall('\d+', regex_example)) #line1
vowels = re.findall('[aeiou]', regex_example) #line2
print(len(vowels)) #line 3
print(len(re.findall('Avengers', regex_example))) #line 4
capitalwords = "[A-Z]\w+" #line 5
print(re.findall(capitalwords, regex_example)) #line 6
Output:
'2018'
re.split()
The other useful method is re.split(), that splits the string in case of a match. In case of no match, it returns a list containing an empty string.
In our example, let's apply this method and split the corpus with a pattern of numbers. The below chunk of code does this task and prints the output.
print(re.split(r"\d+", regex_example))
Output:
'Avengers: Infinity War was a ', ' American superhero film based on the Marvel Comics superhero team the Avengers. It is the ', 'th film in the Marvel Cinematic Universe (MCU). The running time of the movie was ', ' minutes and the box office collection was around ', ' billion dollars. (Source: Wikipedia)'
The above output shows the split was done on digits. It is possible to add the ‘maxsplit’ argument to the re.split() method, that indicates the maximum number of splits that will occur. The default value is zero. The code below uses the maxsplit value as 2, and the method only splits for the first couple of digit occurrences.
print(re.split(r"\d+", regex_example,2))
Output:
'Avengers: Infinity War was a ', ' American superhero film based on the Marvel Comics superhero team the Avengers. It is the ', 'th film in the Marvel Cinematic Universe (MCU). The running time of the movie was 149 minutes and the box office collection was around 2 billion dollars. (Source: Wikipedia)'
Another application would be to split the corpus on spaces. That is achieved with the below code.
print(re.split(r"\s+", regex_example))
Output: 'Avengers:', 'Infinity', 'War', 'was', 'a', '2018', 'American', 'superhero', 'film', 'based', 'on', 'the', 'Marvel', 'Comics', 'superhero', 'team', 'the', 'Avengers.', 'It', 'is', 'the', '19th', 'film', 'in', 'the', 'Marvel', 'Cinematic', 'Universe', '(MCU).', 'The', 'running', 'time', 'of', 'the', 'movie', 'was', '149', 'minutes', 'and', 'the', 'box', 'office', 'collection', 'was', 'around', '2', 'billion', 'dollars.', '(Source:', 'Wikipedia)'
re.sub()
This method is used to substitute the matched text with the contents in the replace variable. If the pattern is not found, the original string is returned.
In our example, let's substitute the word 'Avengers' with 'A'. The below chunk of code does this task.
print(re.sub("Avengers", "A", regex_example))
Output:
A: Infinity War was a 2018 American superhero film based on the Marvel Comics superhero team the A. It is the 19th film in the Marvel Cinematic Universe (MCU). The running time of the movie was 149 minutes and the box office collection was around 2 billion dollars. (Source: Wikipedia)
re.search()
This method is used to search the pattern in the string and return the match object if successful. If the search fails, it returns None.
Let's understand this in the example below, where we will try to find if the word 'Python' is at the beginning of the sentence "Scikit Learn is a great Python library". In our example, since the search fails, the output is 'None'.
example = "Scikit Learn is a great Python library"
match = re.search('\APython', example)
print(match)
Output: None
Regular expressions are a humongous area and it is impossible to cover all of it in one guide. However, we have got a basic understanding of the most commonly used regex methods and their working in Python, which will be useful in the majority of the cases. Let’s now turn towards another important text parsing technique called word tokenization.
Word Tokenization
Tokenization is the process of converting a text or corpus into tokens (smaller pieces). The conversion into tokens is carried out based on certain rules. Using regex, own rules can also be created. Tokenization helps in text pre-processing tasks such as mapping parts of speech, finding and matching common words, cleaning text, and getting the data ready for advanced text analytics techniques like sentiment analysis.
Python has a very popular natural language toolkit library, called 'nltk', that has a rich set of functions for performing many NLP jobs. It can be downloaded through pip and is also included in the Anaconda distribution.
We will be working on the common applications of word tokenization using nltk. To start with, we should import the nltk library. Then, we import the sent_tokenize and word_tokenize functions. This is done in the code below.
# Import necessary modules
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
Now, we will get some text. We will be using the following text for our examples:
textdata = "Pluralsight is the technology skills platform. It has more than 6000 courses and 1100+ employees. Pluralsight is headquartered in Utah and has over 1500 experts authoring high quality courses. Pluralsight grew 161 percent between 2014 and 2017, earning the company a spot in the Deloitte Technology Fast 500 list for five consecutive years."
print(textdata)
Output:
Pluralsight is the technology skills platform. It has more than 6000 courses and 1100+ employees. Pluralsight is headquartered in Utah and has over 1500 experts authoring high quality courses. Pluralsight grew 161 percent between 2014 and 2017, earning the company a spot in the Deloitte Technology Fast 500 list for five consecutive years.
We will now perform word tokenization tasks on this textdata. The most common forms of tokenization are word and sentence tokenization.
The first code below tokenizes the text into words and prints the output. The second code also does tokenization, but this time it's done on sentences.
print(word_tokenize(textdata))
Output:
'Pluralsight', 'is', 'the', 'technology', 'skills', 'platform', '.', 'It', 'has', 'more', 'than', '6000', 'courses', 'and', '1100+', 'employees', '.', 'Pluralsight', 'is', 'headquartered', 'in', 'Utah', 'and', 'has', 'over', '1500', 'experts', 'authoring', 'high', 'quality', 'courses', '.', 'Pluralsight', 'grew', '161', 'percent', 'between', '2014', 'and', '2017', ',', 'earning', 'the', 'company', 'a', 'spot', 'in', 'the', 'Deloitte', 'Technology', 'Fast', '500', 'list', 'for', 'five', 'consecutive', 'years', '.'
print(sent_tokenize(textdata))
Output:
'Pluralsight is the technology skills platform.', 'It has more than 6000 courses and 1100+ employees.', 'Pluralsight is headquartered in Utah and has over 1500 experts authoring high quality courses.', 'Pluralsight grew 161 percent between 2014 and 2017, earning the company a spot in the Deloitte Technology Fast 500 list for five consecutive years.'
It's important to understand the difference between the two functions 'word_tokenize' and 'sent_tokenize'. In the above outputs, we saw that both the functions create different outputs (tokens) for the same input (text data), and that's because of the difference in their functionalities.
We will now look at the unique tokens generated with both the methods in the two lines of code below. In the first output below, the term 'Pluralsight' appears only once as it's a word-level token. However, in the second output, the term appears many times because the tokenization happened at the sentence level.
print(set(word_tokenize(textdata)))
Output:
{'technology', 'headquartered', '.', 'quality', '161', 'five', '2017', 'It', 'spot', 'over', 'Technology', 'skills', 'percent', 'years', '500', 'platform', 'consecutive', 'has', '1500', 'earning', ',', 'the', 'Deloitte', '1100+', 'grew', '6000', 'in', 'between', 'than', 'courses', 'Pluralsight', '2014', 'authoring', 'Utah', 'a', 'list', 'is', 'and', 'Fast', 'more', 'for', 'experts', 'high', 'employees', 'company'}
print(set(sent_tokenize(textdata)))
Output:
{'Pluralsight grew 161 percent between 2014 and 2017, earning the company a spot in the Deloitte Technology Fast 500 list for five consecutive years.', 'Pluralsight is the technology skills platform.', 'It has more than 6000 courses and 1100+ employees.', 'Pluralsight is headquartered in Utah and has over 1500 experts authoring high quality courses.'}
Let's calculate the number of unique tokens generated using both the functions, 'word_tokenize' and 'sent_tokenize', which is done in the two lines of code below. There are 45 unique 'word' tokens, while there are only 4 unique 'sentence' tokens.
print(len(set(word_tokenize(textdata))))
print(len(set(sent_tokenize(textdata))))
Output:
45
4
We have a fair understanding of tokenization. We will now add some visualization to tokenization through the powerful python library, matplotlib. The first line of code imports the matplotlib library, while the second line creates word tokens and stores it in the object 't'. The third line calculates the number of occurrences of each word in the corpus. The fourth and fifth lines of code plot the histogram. We can see that there are many words which have been repeated quite a few times.
#Combining NLP data extraction with plotting
from matplotlib import pyplot as plt #line 1
t = word_tokenize(textdata)
wordlen = [len(w) for w in t]
plt.hist(wordlen)
plt.show()
Output:
Conclusion
In this guide, we have come a long way from understanding basic regular expressions to tokenizing words and sentences. We learned the usage of python's two powerful libraries, re and nltk, using interesting text examples. We also learned how to create visualizations of word tokens using nltk and matplotlib.
Both Regex and NLTK can play a vital role in the text pre-processing phase. However, the domain of these techniques is too big to be covered in this one single guide, which was aimed to be a good building block for individuals aspiring to start working with natural language processing problems using Python’s inbuilt libraries.