Building a Twitter Sentiment Analysis in Python
Jul 1, 2020 • 10 Minute Read
Introduction
The ability to categorize opinions expressed in the text of tweets—and especially to determine whether the writer's attitude is positive, negative, or neutral—is highly valuable. In this guide, we will use the process known as sentiment analysis to categorize the opinions of people on Twitter towards a hypothetical topic called #hashtag.
There are different ordinal scales used to categorize tweets. A five-point ordinal scale includes five categories: Highly Negative, Slightly Negative, Neutral, Slightly Positive, and Highly Positive. A three-point ordinal scale includes Negative, Neutral, and Positive; and a two-point ordinal scale includes Negative and Positive. In this guide, we will use a three-point ordinal scale to categorize tweets with #hashtag.
Getting Started
Sentiment analysis involves natural language processing because it deals with human-written text. You'll have to download a few Python libraries to work with the code. Use pip install <library> to install them.
Setting Up
To train a machine learning model, we need data. You can download the dataset to use in this guide here.
Importing the required libraries.
import pandas as pd
import numpy as np
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
# ML Libraries
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Global Parameters
stop_words = set(stopwords.words('english'))
Loading the Dataset
After you download the CSV, you'll see that there are 1.6 million tweets already coded into three categories by hand.
This dataset encoded the target variable with a 3-point ordinal scale: 0 = negative, 2 = neutral, 4 = positive.
def load_dataset(filename, cols):
dataset = pd.read_csv(filename, encoding='latin-1')
dataset.columns = cols
return dataset
The dataset has six columns ['target', 't_id', 'created_at', 'query', 'user', 'text'], but we are only interested in ['text', 'target']. You can include other columns also if you like. To make it scalable, you need a small script.
def remove_unwanted_cols(dataset, cols):
for col in cols:
del dataset[col]
return dataset
Pre-processing Tweets
This is one of the essential steps in any natural language processing (NLP) task. Data scientists never get filtered, ready-to-use data. To make it workable, there is a lot of processing that needs to happen.
-
Letter casing: Converting all letters to either upper case or lower case.
-
Tokenizing: Turning the tweets into tokens. Tokens are words separated by spaces in a text.
-
Noise removal: Eliminating unwanted characters, such as HTML tags, punctuation marks, special characters, white spaces etc.
-
Stopword removal: Some words do not contribute much to the machine learning model, so it's good to remove them. A list of stopwords can be defined by the nltk library, or it can be business-specific.
-
Normalization: Normalization generally refers to a series of related tasks meant to put all text on the same level. Converting text to lower case, removing special characters, and removing stopwords will remove basic inconsistencies. Normalization improves text matching.
-
Stemming: Eliminating affixes (circumfixes, suffixes, prefixes, infixes) from a word in order to obtain a word stem. Porter Stemmer is the most widely used technique because it is very fast. Generally, stemming chops off end of the word, and mostly it works fine.
- Example: Working -> Work
-
Lemmatization: The goal is same as with stemming, but stemming a word sometimes loses the actual meaning of the word. Lemmatization usually refers to doing things properly using vocabulary and morphological analysis of words. It returns the base or dictionary form of a word, also known as the lemma .
- Example: Better -> Good.
def preprocess_tweet_text(tweet):
tweet.lower()
# Remove urls
tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet, flags=re.MULTILINE)
# Remove user @ references and '#' from tweet
tweet = re.sub(r'\@\w+|\#','', tweet)
# Remove punctuations
tweet = tweet.translate(str.maketrans('', '', string.punctuation))
# Remove stopwords
tweet_tokens = word_tokenize(tweet)
filtered_words = [w for w in tweet_tokens if not w in stop_words]
#ps = PorterStemmer()
#stemmed_words = [ps.stem(w) for w in filtered_words]
#lemmatizer = WordNetLemmatizer()
#lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in stemmed_words]
return " ".join(filtered_words)
Stemming is faster than lemmatization. You can uncomment the code and see how results change. Note: Do not apply both. Remember that stemming and lemmatization are normalization techniques, and it is recommended to use only one approach to normalize. Let your project requirements guide your decision, or you can always do experiments and see which one gives better results. In this case, stemming and lemmatizing yield almost the same accuracy.
- Vectorizing Data: Vectorizing is the process to convert tokens to numbers. It is an important step because the machine learning algorithm works with numbers and not text.
In this guide, you'll implement vectorization using tf-idf. There are other techniques as well, such as Bag of Words and N-grams.
def get_feature_vector(train_fit):
vector = TfidfVectorizer(sublinear_tf=True)
vector.fit(train_fit)
return vector
Important Note: I am using the dataset as the corpus to make a tf-idf vector. The same vector structure should be used for training and testing purposes.
The target column is comprised of the integer values 0, 2, and 4. But users do not usually want their results in this form. To convert the integer results to be easily understood by users, you can implement a small script.
def int_to_string(sentiment):
if sentiment == 0:
return "Negative"
elif sentiment == 2:
return "Neutral"
else:
return "Positive"```
Bringing Everything Together
In this section, we will call all the functions that you have created. You'll see Naive Bayes and Logistic Regression algorithms for predictions. These two algorithms are quite popular in NLP, although you can try out other options too.
# Load dataset
dataset = load_dataset("data/training.csv", ['target', 't_id', 'created_at', 'query', 'user', 'text'])
# Remove unwanted columns from dataset
n_dataset = remove_unwanted_cols(dataset, ['t_id', 'created_at', 'query', 'user'])
#Preprocess data
dataset.text = dataset['text'].apply(preprocess_tweet_text)
# Split dataset into Train, Test
# Same tf vector will be used for Testing sentiments on unseen trending data
tf_vector = get_feature_vector(np.array(dataset.iloc[:, 1]).ravel())
X = tf_vector.transform(np.array(dataset.iloc[:, 1]).ravel())
y = np.array(dataset.iloc[:, 0]).ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)
# Training Naive Bayes model
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)
y_predict_nb = NB_model.predict(X_test)
print(accuracy_score(y_test, y_predict_nb))
# Training Logistics Regression model
LR_model = LogisticRegression(solver='lbfgs')
LR_model.fit(X_train, y_train)
y_predict_lr = LR_model.predict(X_test)
print(accuracy_score(y_test, y_predict_lr))
Naive Bayes is giving nearly 76% accuracy, and Logistic Regression gives nearly 79%. These accuracy figures are recorded without implementing stemming or lemmatization. Using better techniques, you might get better accuracy.
Testing on Real-time Feeds
This step is completely optional and will only apply if you have read and implemented the guide Building a Twitter Bot with Python.
test_file_name = "trending_tweets/08-04-2020-1586291553-tweets.csv"
test_ds = load_dataset(test_file_name, ["t_id", "hashtag", "created_at", "user", "text"])
test_ds = remove_unwanted_cols(test_ds, ["t_id", "created_at", "user"])
# Creating text feature
test_ds.text = test_ds["text"].apply(preprocess_tweet_text)
test_feature = tf_vector.transform(np.array(test_ds.iloc[:, 1]).ravel())
# Using Logistic Regression model for prediction
test_prediction_lr = LR_model.predict(test_feature)
# Averaging out the hashtags result
test_result_ds = pd.DataFrame({'hashtag': test_ds.hashtag, 'prediction':test_prediction_lr})
test_result = test_result_ds.groupby(['hashtag']).max().reset_index()
test_result.columns = ['heashtag', 'predictions']
test_result.predictions = test_result['predictions'].apply(int_to_string)
print(test_result)
Replace the file name with your own in the test_file_name variable.
Conclusion
I hope you enjoyed reading this guide. Sentiment analysis is a popular project that almost every data scientist will do at some point. It can solve a lot of problems depending on you how you want to use it.
I highly recommended using different vectorizing techniques and applying feature extraction and feature selection to the dataset. Try to implement more machine learning models and you might be able to get accuracy over 85%.
If you have any questions, feel free to reach out to me at CodeAlphabet.