Natural Language Processing using TensorFlow
The first step behind understanding this technology is representing words in a way a computer can process them, and then later training the Neural Network that can understand their meaning. This process is called TOKENIZATION.
So , the first question that arises is what are Neural Networks?
Neural Networks are the heart of DEEP LEARNING algorithms. Its name is inspired by the human brain, which mimics the way neurons in our brain signal to one another.
They allow computer programs to recognize patterns and solve common problems in the fields of AI, machine learning, and deep learning.
They comprise of node layers which in turn are of 3 types-
- Input Layer
- Hidden Layers(number depends on the application)
- Output Layer
Each node in a layer, is connected to the next layer, and has an ASSOCIATED WEIGHT and a THRESHOLD, exceeding which the data is transferred to the next layer otherwise not!
They rely on training and learning over time and increasing its accuracy.
One of the most interesting example of application of neural networks is in Google Speech’s Algorithm.
Coming to TOKENIZATION !
Consider a word which is made up of letters which in turn can be encoded into numbers for the computer to understand, a popular example is ASCII.
Now assigning numbers to each particular word can make it harder for us to give a sentiment to a word, because 2 different words with same letters but in a shuffled order will have a different meaning. So, it will be easier to encode each word instead of each letter.
Now, let us look at how we can use the Tokenizer API from TensorFlow Keras in python to achieve this!
Here we are going through all the sentences and assigning each unique word with a token and we can see the assigned token using word_index property of tokenizer. The output for this will look something like this.
The next step is to represent the words into sentences in a correct order, which is done by SEQUENCING after which our data will be ready for processing by a neural network, to understand and create a new text.
So, coming to SEQUENCING !
The tokenizer provides us with a method text_to_sequences , it creates sequences of tokens representing each sentences.
The output for this will look like
So, now what happens when our tokenizer sees words that it has never seen before?
It will ignore the words it has not seen before and hence we will be loosing the length of our new sentence. To avoid that we can use the OOV(out of vocabulary) property which replaces the unidentified values with a specified token, using this we will at least not lose the original length.
Next to train our Neural Network to deal with sentences of unequal lengths, we will use something as simple as PADDING.
We will use pad_sequences property of TensorFlow Keras on the sentences.
The output for which will look like this-
The padded sentences are of the same length i.e of the maximum length present in the corpus or we can even set our desired max length , it can be done in 2 ways-
- Padding Front (padding=”pre”)
- Padding Back(padding=”post”)
After this pre-processing of our training data, it is ready to be processed by the Neural Networks. Some of the applications it can be used for are-
- Classifying text into sarcastic and non-sarcastic
- Detecting cyberbullying
- Classifying movies as good or bad using user reviews
In the next blog, we will be looking at a model to classify text as Sarcastic or Non-Sarcastic.