The growth of the internet due to social networks such as Facebook, Twitter, Linkedin, Instagram, etc. has led to significant users interaction and has empowered users to express themselves in the most efficient way possible. But recently with the bombardment of social media, we are starting to lose touch with our close friends through basic forms of communication such as texting and events. In order to help with this predicament, we decided to create a model that carries out sentiment analysis between text messages and with time analysis suggest actions carry out to increase the relationship between two users.
The traditional approach to manually extract complex features, identify which feature is relevant, and derive the patterns from this huge information is very time-consuming and requires significant human efforts.
However, Machine Learning can exhibit excellent performance via Natural Language Processing (NLP) techniques to perform sentiment analysis on this conversational information. The core idea of these techniques is to identify complex features extracted from this vast amount of data using learning models. These algorithms automatically learn new complex features. Here the goal is to classify the opinions and sentiments expressed by users.
Choosing a dataset
When it came to looking for a viable dataset, we searched extensively to look for messages that were similar to chat messages. We came across very few due to privacy policies against personal messaging. We hit the jackpot when we found 4 datasets — Enron mail database, IMDb review dataset, twitter dataset, and finally freecodecampchat dataset based on a need basis, we eliminated 3 of the dataset — Enron mail dataset (the conversations were formal and in an email format — formal conversations are rare in day to day casual chatting); IMDb dataset (As they were simply reviews of movies); Twitter dataset (Although this dataset consisted of casual language and slang words, it didn't match the criteria of a format of a chat message between two people)
This finally lead us to our final dataset which was the golden pot — freecodecamp chatroom dataset. The conversations on the chatroom hit all checkmarks of text messaging — informal casual language emojis, slang words and short texts with emotion!
Creation of a dataframe
The next step was uploading of the dataset to the python notebook and to display a part of the data to know how it looks.
Woah! 5 million samples! As computation time is an important factor in building a model, we decided to reduce it to around 100 thousand samples.
Cleaning, Pickling, and Tokenization
Every dataset cannot be used in the raw form and requires some cleaning to remove all the unnecessary details. Training classifiers and machine learning algorithms can take a very long time, especially if you’re training against a larger data set. Can you imagine having to train the classifier every time you wanted to fire it up and use it? What horror!
Instead, what we can do is use the Pickle module to go ahead and serialize our classifier object, so that all we need to do is load that file in real quick. And that’s what we did!
Pickle is a module in Python used for serializing and de-serializing Python objects. It converts Python objects like lists, dictionaries, etc. into byte streams (zeroes and ones). You can convert the byte streams back into Python objects through a process called unpickling.
After pickling of the dataset — we proceeded to tokenization — Tokenization is the process by which big quantity of text is divided into smaller parts called tokens. We use the method word_tokenize() to split a sentence into words.
The output of word tokenization can be converted to Data Frame for better text understanding in machine learning applications. It can also be provided as input for further text cleaning steps such as punctuation removal, numeric character removal, or stemming.
Sentiment Classifier
Here comes the main part of the project. After all the cleaning and pickling, we proceeded to create the model required to calculate the sentiment of the text in the dataset. The two methods we explored were machine learning algorithms and lexicon-based classifier
Lexicon based sentiment analysis: Application of a lexicon is one of the two main approaches to sentiment analysis and it involves calculating the sentiment from the semantic orientation of word or phrases that occur in a text. With this approach a dictionary of positive and negative words is required, with a positive or negative sentiment value assigned to each of the words.
We used Vader for this — We shall further discuss more about it in the next section
VADER
VADER ( Valence Aware Dictionary for Sentiment Reasoning) is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. It is available in the NLTK package and can be applied directly to unlabeled text data.
VADER sentimental analysis relies on a dictionary that maps lexical features to emotion intensities known as sentiment scores. The sentiment score of a text can be obtained by summing up the intensity of each word in the text.
We imported the necessary packages from NLTK — Vader lexicon — the NLTK package includes an analyzer which is called SentimentIntensityAnalyzer
VADER’s SentimentIntensityAnalyzer() takes in a string and returns a dictionary of scores in each of four categories:
negative
neutral
positive
compound (computed by normalizing the scores above)
Lets talk about the math behind it — VADER is sensitive to both Polarity (whether the sentiment is positive or negative) and Intensity (how positive or negative is sentiment) of emotions.
VADER incorporates this by providing a Valence Score to the word into consideration. Valence score the Valence score is measured on a scale from -4 to +4, where -4 stands for the most ‘Negative’ sentiment and +4 for the most ‘Positive’ sentiment. Intuitively one can guess that midpoint 0 represents ‘Neutral’ Sentiment,and this is how it is defined actually too.
The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence.
Below is the basic formula behind VADER
where x = sum of valence scores of constituent words, and α = Normalization constant (default value is 15)
Addition of scores to dataframe
After the required cleaning of the dataset, we proceeded to add columns to the original DataFrame to store polarity_score dictionaries, extracted compound scores, and new “pos/neg” labels derived from the compound score.
The last column acted as required for an accuracy test. The text in this method was classified into negative, positive and, neutral ratios.
As you can see, VADER is able to calculate the ratios of positive, neutral and negative as well as compound score of the statement. The compound score acts as the decider which states whether a statement is positive or negative, as shown in the images above.
VADER Accuracy
In order to understand if a model is working properly, we need to check its accuracy. We decided to clean the text further to focus more on the text rather than the punctuations and added a column into our dataframe called “cleaned_text” and “cleaned_text_final(you will find out why we are doing this further in the post!)
Training and Testing
So you might be thinking, Why did we clean the text twice? In order to understand if our VADER sentiment analyzer worked, we needed to know its accuracy — if it is able to display the correct sentiment according to the labeled data.
Hence we divided the dataset into independent var (cleaned_text) and dependent var(comp_score) which would then help us create a split (training and testing dataset) with test_size and we proceeded to randomize it.
Now that we have our training and testing set, we proceeded to analyze the confusion matrix — understand the true positives, true negatives, false positives, and false negatives!
The job is not done yet! We still don't know the accuracy. Next step was to calculate not only the accuracy, but also the precision and recall of the VADER model.
94% Accuracy! We for a second believed that this was it, this was the final model but that didn’t work out well. As when we tested it on a more complicated statement. This is what happened:
Here VADER took the word “dreadful” more into account than “really GOOD”, which then labelled the sentence as negative, when in fact it was positive.
VADER acts like a double-edged sword. The accuracy is high as it utilized a pre-existing dictionary yet this very point acts as a disadvantage as the model cannot be moulded to give solutions to specific needs or problem statements.
This is where machine learning models come in play.
Naive Bayes Classifier
Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive)independence assumptions between the features. The best part about this classifier was that we could tweak its parameters to fit our needs!
It is based on the application of the Baye’s rule given by the following formula:
In our case, a text d is represented by a vector of K attributes such as d = (w, w, …, w ) . 1 2 k Computing P(d|c) is not trivial and that’s why the Naive Bayes introduces the assumption that all of the feature values wj are independent given the category label c . That is, for i=/ j, wi and wj are conditionally independent given the category label c. So the Baye’s rule can be rewritten as:
There are several variants of Naive Bayes classifiers that are: The Multi¬variate Bernoulli Model, The Multinomial Model, the Gaussian Model. For this project, the most suitable was the multinomial model as we needed to display sentiment as positive, neutral, and negative.
Training and Testing the new Naive Bayes Classifier Model
We created the model and tweaked its parameters in such a way that it could learn from our dataset and give more accurate sentiment ratios and finally sentiments.
We used the sampled dataset for Naive Bayes to reduce the computation time from one hour to 18 minutes!
As you can see, this model not only told us about positive and negative texts but also included neutral texts — another advantage of using Naive Bayes over VADER.
Naive Bayes Classifier Accuracy
After finding the great classifier model, we couldn’t wait to calculate the accuracy of this new improved model. The procedure of calculation of the same was simple — we used the training and testing set and plotted a confusion matrix array and the results were as follows:
What a beautiful confusion matrix! Next step is to calculate the accuracy,precision and recall for the same
89% Accuracy- Although the accuracy was lower than expected, the threshold values used for the calculation of the sentiments were the ones that gave the highest accuracy. Hence we were satisfied with out results
Testing out the new model with new data
Ending on a good note, we decided to test out a random statement of our own and see if our model worked the way it should. We used the statement -
i think the good example in the link will clarify
Output : [i think the good example in the link will clarify] Positive
Hurray! We found our perfect model!
Conclusion
Looking at both the accuracies, we decided to choose Naïve Bayes Classifier over Vader for the following two reasons:
· The total computation time is low.
· Being a machine learning based model, with required tweaking of the parameters — The model is more specific to the required solution — Here it being an effective backend for a virtual assistant.
Nowadays, sentiment analysis or opinion mining is a hot topic in machine learning. We are still far to detect the sentiments of s corpus of texts very accurately because of the complexity in the English language.
In this project we tried to show the basic way of classifying slack-type data into positive or negative category using Naive Bayes and how language models are related to the Naive Bayes and can produce better results. We could further improve our classifier by trying to extract more features from the tweets, trying different kinds of features, tuning the parameters of the naïve Bayes classifier, or trying another classifier all together.
The future of sentiment analysis is going to continue to dig deeper, far past the surface of the number of likes, comments and shares, and aim to reach, and truly understand, the significance of social media interactions and what they tell us about the users behind the screens. Virtual Assistants with sentiment analysis will be able to make more accurate suggestions to the users in order to make their relationship with other users stronger. Social media which today can make people lonelier, in the future will help you reconnect back to friends and family — the forgotten roots of social media will be rediscovered.
Thank you!
This project holds a special place in my heart as it was our final year capstone project at MIT World Peace University, Pune
This project was not just a one-man job. It took the best of three people. Me — Nivedita Shanbhag and my wonderful teammates Ishwari Modak and Jahnvi Warbhe. It would have been impossible to achieve these results if I didn’t have the help and support of my teammates. All three of us worked in unison on the code, as well as the report, and presented it to our university.
I am highly indebted to Mr. Sabareesh Chinta for his guidance and constant supervision as well as for providing necessary information regarding the project and also for his support in completing the project. I would like to express my gratitude to the Head of the School of Electronics and Communication Dr. Arti A. Khaparde for her kind co-operation and encouragement which helped me in the completion of this project and Dr.Rupali Kute for timely inputs and weekly reviews of my project work.
I would like to express my special gratitude and thanks to MIT-WPU for giving me such great opportunities to better my skills.
My thanks and appreciations also go to my colleagues in developing the project and people who have willingly helped me out with their abilities.
It was the loveliest experience and I bet all three of us are looking forward to a wonderful future ahead with the store-back knowledge of the project.