Named Entity Recognition (NER)
Jul 10, 2019 • 9 Minute Read
Introduction
In this guide, you will learn about an advanced Natural Language Processing technique called Named Entity Recognition, or 'NER'.
NER is an NLP task used to identify important named entities in the text such as people, places, organizations, date, or any other category. It can be used alone, or alongside topic identification, and adds a lot of semantic knowledge to the content, enabling us to understand the subject of any given text.
Let us start with loading the required libraries and modules.
Loading the Required Libraries and Modules
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
nltk.download('wordnet') #download if using this module for the first time
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords') #download if using this module for the first time
from nltk.tokenize import word_tokenize
from nltk import word_tokenize, pos_tag, ne_chunk
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
We will be using the following text for this guide:
textexample = "Avengers: Endgame is a 2019 American superhero film based on the Marvel Comics superhero team the Avengers, produced by Marvel Studios and distributed by Walt Disney Studios Motion Pictures. The movie features an ensemble cast including Robert Downey Jr., Chris Evans, Mark Ruffalo, Chris Hemsworth, and others. (Source: wikipedia)."
print(textexample)
Output:
Avengers: Endgame is a 2019 American superhero film based on the Marvel Comics superhero team the Avengers, produced by Marvel Studios and distributed by Walt Disney Studios Motion Pictures. The movie features an ensemble cast including Robert Downey Jr., Chris Evans, Mark Ruffalo, Chris Hemsworth, and others. (Source: wikipedia).
Word Tokenization
The first step is to tokenize the text into sentences which is done in the first line of code below. The second line performs word tokenization on the sentences, while the third line prints the tokenized sentence.
sentences = nltk.sent_tokenize(textexample)
tokenized_sentence = [nltk.word_tokenize(sent) for sent in sentences]
tokenized_sentence
Output:
['Avengers'
Parts of Speech (POS) Tagging
Parts-of-speech tagging, also called grammatical tagging, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context. The line of code below takes the tokenized text and passes it to the 'nltk.pos_tag' function to create its POS tagging.
pos_tagging_sentences = [nltk.pos_tag(sent) for sent in tokenized_sentence]
Let us combine these two steps into a function and analyse the output. The first to fourth lines of code below creates the function to tokenize the text and perform POS tagging. The fifth line of code runs the function to our text, while the sixth line prints the output.
def preprocess(text):
text = nltk.word_tokenize(text)
text = nltk.pos_tag(text)
return text
processed_text = preprocess(textexample)
processed_text
Output:
('Avengers'
The output above shows that every token has been tagged to its parts of speech. Some of the common abbreviations are explained below:
- DT: determiner
- IN: preposition/subordinating conjunction
- JJ: adjective ‘big’
- JJR: adjective, comparative ‘bigger’
- JJS: adjective, superlative ‘biggest’
- LS: list marker
- NN: noun, singular ‘desk’
- NNS: noun plural ‘desks’
- NNP: proper noun, singular ‘Harrison’
- NNPS: proper noun, plural ‘Americans’
- PRP: personal pronoun I, he, she
- RB: adverb very, silently,
- UH: interjection
- VB: verb, base form take
- VBD: verb, past tense took
Chunking
Once we have completed the parts-of-speech tagging, we will perform chunking. In simple terms, what chunking does is that it adds more structure to the sentence over and above the tagging. The output results in grouping of words called 'chunks'.
We will perform chunking to the processed text which is done in the first line of code below. The second to fourth lines of code does the chunking, and in our example, we will only look at Nouns for the NER tagging.
res_chunk = ne_chunk(processed_text)
for x in str(res_chunk).split('\n'):
if '/NN' in x:
print(x)
Output:
Avengers/NNS
Endgame/NN
superhero/NN
film/NN
(ORGANIZATION Marvel/NNP Comics/NNP)
superhero/NN
team/NN
(ORGANIZATION Avengers/NNPS)
(PERSON Marvel/NNP Studios/NNP)
(PERSON Walt/NNP Disney/NNP Studios/NNP)
Motion/NNP
Pictures/NNP
movie/NN
cast/NN
(PERSON Robert/NNP Downey/NNP Jr./NNP)
(PERSON Chris/NNP Evans/NNP)
(PERSON Mark/NNP Ruffalo/NNP)
(PERSON Chris/NNP Hemsworth/NNP)
others/NNS
(PERSON Source/NN)
wikipedia/NN
Let us explore the above output. We observe that the word tokens 'Endgame', 'film', and 'Source' are tagged as singular noun 'NN', while tokens like 'Avengers' and 'others' are tagged as plural noun 'NNS. Also, note that the names of the actors 'Robert', 'Evans', etc., have been tagged as proper noun 'NNP'.
Conclusion
In this guide, you have learned about how to perform Named Entity Recognition using nltk. You learned about the three important stages of Word Tokenization, POS Tagging, and Chunking that are needed to perform NER analysis.
To learn more about Natural Language Processing with Python, please refer to the following guides: