Sentiment Analysis using NLP in TensorFlow
How to build a classifier on how to recognize sentiment in text?
For understanding this, we will be looking at an example dataset of news headlines from Kaggle and we will be classifying them as Sarcastic or Not-Sarcastic. We will be training a classifier on this which will be determining our output!
The original dataset from Kaggle is present in JSON format, with the URL for the article, its Headline followed by a-
0-if its SARCASTIC
1-if its NOT SARCASTIC
The first step involves converting this JSON format into python LIST format which is all encapsulated within a square bracket and now it will look like this-
This can be achieved in Python using JSON library ,by importing the JSON file and iterating it to add it to the lists created for storing the information of the dataset for classifying.
Now, we will start with the pre-processing of the data starting with TOKENIZATION on the news headlines alone from the entire corpus!
Then we will be converting our sentences into sequences of tokens, followed by padding them to the same length which will look like this(26709 sequences with 40 tokens each)-
Next, we will be slicing our dataset into 2 parts-
- Training dataset
- Testing Dataset
To really test the efficiency of our neural network we will be fitting our tokenizer only into the training dataset, and then sequencing and padding it.
So, now how do we classify our data as SARCASTIC or not?
Consider the most basic sentiments, GOOD or BAD, we often see them as opposites so lets plot it as positive and negative X axis.
Now plotting more words, using this base condition and giving them coordinates, we can start determining the sentiment behind that word!
So, what if we take Sarcastic and Not Sarcastic as the parameters and extend this simple scenario into Multi Dimensional.
As we load more and more data into the model, these directions can change.
When we give a test sentence after full training of the model, it will look for the words, sum them up and give us a classification, this process is called EMBEDDING.
Now, lets take a look at coding this Neural Network!
- The top layer is an embedding, where the direction of each word is learnt Epoch by Epoch.
- Next we pool with a Global Average Pooling i.e adding up the vectors for each word in a sentence.
This is then fed into a Deep Neural Network!
After training it for around 30 Epochs, the model was able to predict with 99 percent accuracy whereas with the testing data i.e the words the model has never seen before it still got 82–84 percent accuracy!
Now, lets try to predict the output for a new data-
Give it a try, and see for yourself whether these 2 inputs are sarcastic or not!
The output will look something like this indicating that the first sentence has a very high chance of SARCASM whereas the second one has a very low chance!