Building Features from Text Data
Learn how to extract features from raw text for predictive modeling, and create Tf-Idf and Bag-of-words (BOW) feature matrices.
Jul 19, 2019 • 16 Minute Read
Introduction
Text data is different from structured tabular data and, therefore, building features on it requires a completely different approach. In this guide, you will learn how to extract features from raw text for predictive modeling. You will also learn how to perform text preprocessing steps, and create Tf-Idf and Bag-of-words (BOW) feature matrices. We will begin by exploring the data.
Data
In this guide, we will be using tweet data about the company 'Apple'. The objective is to create features that can be used for building a sentiment predictor model.
The dataset contains 1181 observations and 3 variables, as described below:
-
Tweet: Consists of the twitter comments by the users. The twitter data is publicly available.
-
Avg: Average sentiment of the tweets (-2 means extremely negative while +2 means extremely positive). This classification was done using the Amazon Mechanical Turk.
-
Sentiment: Consists of the sentiment labels - positive, negative, and neutral.
Loading the Required Libraries and Modules
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import nltk
import warnings
%matplotlib inline
warnings.filterwarnings("ignore", category=DeprecationWarning)
from nltk.corpus import stopwords
stop = stopwords.words('english')
Loading the Data and Performing Basic Data Checks
The first line of code below reads in the data as pandas dataframe, while the second line prints the shape - 1,181 observations of 3 variables. The third line prints the first five observations.
dat = pd.read_csv('datatweets.csv')
print(dat.shape)
dat.head(5)
Output:
(1181, 3)
| | Tweet | Avg | Sentiment |
|--- |--------------------------------------------------- |------ |----------- |
| 0 | iphone 5c is ugly as heck what the freak @appl... | -2.0 | Negative |
| 1 | freak YOU @APPLE | -2.0 | Negative |
| 2 | freak you @apple | -2.0 | Negative |
| 3 | @APPLE YOU RUINED MY LIFE | -2.0 | Negative |
| 4 | @apple I hate apple!!!!! | -2.0 | Negative |
We will start by performing basic analysis of the data. The line of code below prints the number of tweets, as per the 'Sentiment' label. The output shows that the highest number of tweets are for the negative sentiment, while the lowest are for the positive sentiment.
# Get the number of dates / entries in each month
dat.groupby('Sentiment')['Tweet'].count()
Output:
Sentiment
Negative 541
Neutral 337
Positive 303
Name: Tweet, dtype: int64
The sentiment score for the tweets is stored in the variable 'Avg', that ranges from -2 (extremely negative) to +2 (extremely positive). We will explore if there is a difference in the average sentiment scores across the 'sentiment' label. The line of code below performs this task and the output shows that the average negative score is -0.74, while the average positive score is 0.57.
dat.groupby('Sentiment')['Avg'].mean()
Output:
Sentiment
Negative -0.743068
Neutral 0.000000
Positive 0.574257
Name: Avg, dtype: float64
Building Simple Features from Raw Text
Many simple but important features can be extracted from the raw text data, as discussed below.
Character Length
The hypothesis is that the length of the characters in a tweet varies across the sentiment it carries. The first line of code below creates a new variable 'character_cnt' that takes in the text from the 'Tweet' variable and calculates the count of the characters in the text. The second line performs the 'groupby' operation on the 'Sentiment' label and prints the average character length across the labels.
The output shows that the neutral sentiments have a lower character count on an average, as compared to the positive and the negative tweets. This inference can be useful for separating the neutral tweets from the other types of tweets.
dat['character_cnt'] = dat['Tweet'].str.len()
dat.groupby('Sentiment')['character_cnt'].mean()
Output:
Sentiment
Negative 91.763401
Neutral 85.379822
Positive 94.825083
Name: character_cnt, dtype: float64
Word Count
Just like the character count in a tweet, the word count can also be a useful feature. The first line of code below creates a new variable 'word_counts' that takes in the text from the 'Tweet' variable and calculates the count of the words in the text. The second line performs the 'groupby' operation on the 'Sentiment' label and prints the average word length across the labels.
The output shows that the negative sentiments have the highest average word count, suggesting that the disappointed customers tend to write longer tweets. This inference can be useful for separating the 'sentiment' labels.
dat['word_counts'] = dat['Tweet'].str.split().str.len()
dat.groupby('Sentiment')['word_counts'].mean()
Output:
Sentiment
Negative 15.336414
Neutral 12.356083
Positive 14.676568
Name: word_counts, dtype: float64
Average Character Length per Word
Since we have created the 'character_cnt' and the 'word_counts' features, it is easy to create the ratio of these two variables that will give the average length of the character per word in each tweet.
The first line of code below creates a new variable 'characters_per_word' that is the ratio of the number of characters and the number of words in a tweet. The second line performs the 'groupby' operation on the 'Sentiment' label and prints the average character length per word across the labels.
The output shows that neutral sentiments have the highest average character length per word. This inference can be useful for separating the 'Sentiment' labels.
dat['characters_per_word'] = dat['character_cnt']/dat['word_counts']
dat.groupby('Sentiment')['characters_per_word'].mean()
Output:
Sentiment
Negative 6.191374
Neutral 7.425695
Positive 6.687928
Name: characters_per_word, dtype: float64
Special Character Count
It is also possible to create a feature that contains the count of special characters like '@' or '#'. The first line of code below creates a new feature 'spl' that takes in the text from the 'Tweet' variable and calculates the count of the words starting with the special character '@'. We use the starts with function for performing this operation. The second line prints the first five observations containing the 'Tweet' and the 'spl' variable.
dat['spl'] = dat['Tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('@')]))
dat[['Tweet','spl']].head()
Output:
| | Tweet | spl |
|--- |--------------------------------------------------- |----- |
| 0 | iphone 5c is ugly as heck what the freak @appl... | 2 |
| 1 | freak YOU @APPLE | 1 |
| 2 | freak you @apple | 1 |
| 3 | @APPLE YOU RUINED MY LIFE | 1 |
| 4 | @apple I hate apple!!!!! | 1 |
Number Count
Just like we created a feature on the count of words in a tweet, we can also create a feature on the count of numbers in a tweet. The first line of code below creates a new variable 'num' that takes in the text from the 'Tweet' variable and calculates the count of the numbers in the text. The second line performs the 'groupby' operation on the 'Sentiment' label and prints the average count of numbers across the labels.
The output shows that the neutral sentiment labels have the lowest average count of numbers in a tweet, whereas the negative tweets have the highest average.
#Number of numerics
dat['num'] = dat['Tweet'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
dat.groupby('Sentiment')['num'].mean()
Output:
Sentiment
Negative 0.125693
Neutral 0.068249
Positive 0.108911
Name: num, dtype: float64
Pre-processing the Raw Text
So far, we have created simple features from the raw text. We can also create more advanced features but, before that, we will have to clean the text. The common pre-processing steps are summarized below:
-
Removing punctuation - the rule of thumb is to remove everything that is not in the form x,y,z. The first line of code below performs this task.
-
Removing stopwords - these are unhelpful words like 'the', 'is', 'at'. These are not helpful because the frequency of such stopwords is high in the corpus, but they don't help in differentiating the target classes. The removal of Stopwords also reduces the data size. The second line of code below performs this task.
-
Conversion to lowercase - words like 'Phone' and 'phone' need to be considered as one word. Hence, these are converted to lowercase. The third line of code below performs this task.
-
Stemming - the goal of stemming is to reduce the number of inflectional forms of words appearing in the text. This causes words such as “argue”, "argued", "arguing", "argues" to be reduced to their common stem “argu”. There are many ways to perform Stemming, the popular one being the “Porter Stemmer” method by Martin Porter. The fourth to sixth lines of code below perform this task.
The last line of code prints a summary of all the new features that we have built so far.
dat['processedtext'] = dat['Tweet'].str.replace('[^\w\s]','')
dat['processedtext'] = dat['processedtext'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
dat['processedtext'] = dat['processedtext'].apply(lambda x: " ".join(x.lower() for x in x.split()))
#Lines 4 to 6
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
dat['processedtext'] = dat['processedtext'].apply(lambda x: " ".join([stemmer.stem(word) for word in x.split()]))
dat[['character_cnt','word_counts','characters_per_word', 'spl', 'num', 'processedtext']].head()
Output:
| | character_cnt | word_counts | characters_per_word | spl | num | processedtext |
|--- |--------------- |------------- |--------------------- |----- |----- |--------------------------------------------- |
| 0 | 64 | 11 | 5.818182 | 2 | 0 | iphon 5c ugli heck freak appl iphonecompani |
| 1 | 16 | 3 | 5.333333 | 1 | 0 | freak you appl |
| 2 | 16 | 3 | 5.333333 | 1 | 0 | freak appl |
| 3 | 25 | 5 | 5.000000 | 1 | 0 | appl you ruin my life |
| 4 | 24 | 4 | 6.000000 | 1 | 0 | appl i hate appl |
Term Frequency-Inverse Document Frequency (TF-IDF) Vector
We have cleaned the text which is now stored in a new variable 'processedtext'. However, in order to use it for building machine learning models, we will have to convert it to word frequency vectors.
One of the most popular methods to do this is through the TF-IDF representation, which is used as a weighting factor in text mining applications. In simple terms, TF-IDF attempts to highlight important words which appear frequently in a document but not across documents. The terms are briefly explained below:
-
Term Frequency (TF): This summarizes the normalized Term Frequency within a document.
-
Inverse Document Frequency (IDF): This reduces the weight of terms that appear a lot across documents.
Now, we will work on creating the TF-IDF vectors for our tweets. The first line of code below imports the 'TfidfVectorizer' from sklearn.feature_extraction.text module. The second line initializes the TfidfVectorizer object, called 'tfidf', while the third line fits and transforms the variable 'processedtext' from the data.
The important arguments we have used in initiating the TfidfVectorizer object are the 'max_features' and the 'ngram_range'. While the 'max_features' argument specifies the maximum number of features to be created, the argument 'ngram_range=(1,1)' specifies that unigrams will be considered for feature creation.
The fourth line prints a summary of the object, which is a sparse matrix containing the number of observations (1181) and the number of features (500).
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=500, lowercase=True, analyzer='word', stop_words= 'english',ngram_range=(1,1))
dat_tfIdf = tfidf.fit_transform(dat['processedtext'])
dat_tfIdf
Output:
<1181x500 sparse matrix of type '<class 'numpy.float64'>'
with 6473 stored elements in Compressed Sparse Row format>
Bag-of-words Vector
Another popular technique for creating word vectors is the Bag-of-words approach. It is a simplistic method for identifying topics in a document. It works on the assumption that the higher the frequency of the term, the higher its importance.
The first line of code below imports the 'CountVectorizer' utility from the 'sklearn.feature_extraction.text' module. The second line initializes the CountVectorizer object, called 'bag_words', while the third line fits and transforms the variable 'processedtext' from the data. The fourth line prints a summary of the object, which is, again, a sparse matrix containing the number of observations (1181) and the number of features (500).
from sklearn.feature_extraction.text import CountVectorizer
bag_words = CountVectorizer(max_features=500, lowercase=True, ngram_range=(1,1),analyzer = "word")
dat_BOW = bag_words.fit_transform(dat['processedtext'])
dat_BOW
Output:
<1181x500 sparse matrix of type '<class 'numpy.int64'>'
with 7181 stored elements in Compressed Sparse Row format>
Conclusion
In this guide, you have learned the fundamentals of building features from the raw and the processed text data. You can now use the basic as well as advanced features for building a machine learning algorithm that can predict the sentiment of a tweet.
To learn more about Natural Language Processing and Text Analytics, please refer to the following guides: