Visualize NLP-The intuition behind RNN, LSTM, GRN and Attention Mechanism

7 min readDec 4, 2020

Predicting the next event is a wish everyone tries to achieve.

It has major benefits in the fields of Stock Markets, Movies, Artists, Gambling, Language and arts, Language modelling, understanding genome sequences, medical signals.

I wish I had this technicality in my brain where I can activate the function in order to predict the next move or an event based on past data.

Maybe in the future, we might find this kind of tech in our brains using Artificial Intelligence and Deep Learning for creating Shakespeare poems, speeches, stories, movies and so on.

Sounds Sci-fi !!!

Now let us come back to the topic.

We are talking about RNN which are the foretellers of a neural network.

RNN (Recurrent Neural Network)

Let us take an example for explaining RNN. Example Sentence- “ Sun rises in the east”

RNN take each word as an input and passes the output of the previous hidden state as an input to the next layer. Given “Sun rises in the ” it can predict the output as “east”(By giving higher weight to the word “east”).

Vanishing Gradient and Short Memory are the problems faced by the traditional or vanilla RNN. By the time the network sees all the inputs, the network might have forgotten the previous inputs as the weight of those inputs will be closer to 0.

An RNN is a network that feeds into itself

To hold much greater information, LSTM is used, it is similar to traditional RNN but with some extra topping such as gates are used to hold the memory of its previous states.

LSTM-Long Short Term Memory

Think RNN like a Dory in the movie Finding Nemo which has little memory of the previous states.

LSTM can be thought of NEMO which has better long term dependency of memory or the context.

LSTM has primarily four gates: forget gate, input gate, update gate, output gate,

It has basically three inputs: Long term memory state (cell state Cₜ), Short memory state ( hidden state- Hₜ) and input (Xₜ).

The Process flow can be described into two steps.

Step 1: When the 3 inputs enter the LSTM they go into either the forget gate, input gate and output gate.

The long term memory(Cell State) will be forgotten using the forget gate.

The short-term state(Hidden State) and “X” will go into the input gate. This gate decides what info will be learned.

Step 2: data that flows through the forget gate (it is not forgotten, forgotten info stays at the gate) and info that passes input gate (it is adding new info) will go to the update gate(which makes up the new long term memory) and the output gate (which updates the short term memory(Hidden State) and is the outcome of the network).

Cₜ-Context memory state holding all the context of the inputs.
Hₜ-Hidden state consists of the previous state information.
Xₜ-inputs to the LSTM.

Forget Gate

This gate decides what information to forget from the cell state(Cₜ). It has a sigmoid layer with element-wise multiplication with the previous cell state(Cₜ₋₁). The input to the sigmoid layer is a neural network with two inputs previous hidden state ( Hₜ₋₁) and input(Xₜ). Basically, the long-term memory (LTM) gets multiplied by a forget gate function (fₜ). This factor will make some of the long-term information be “forgotten”.

Scenario- Assume Forget gate as a regulator or think of the time where after every exam you try to forget the previous subject exam details(Hₜ₋₁) from your memory(Long term memory — Cₜ)as it will be useless for further exam.

Input gate

This gate is used for updating the cell state(Cₜ₋₁) with new information.We use tanh activation due to its squashing capabilities.

This gate combines existing Short-term memory (Hₜ₋₁) and some input Xₜ, multiplies by a matrix (W) and adds bias. Then squishes this all into a tanh activation function.

Then it ignores some of the short-term memory, by multiplying the combined result by an ignore function.

The ignore function is calculated by combining Hₜ₋₁and Xₜ, with a new set of W(weights) and b(biases)

Once we have N and I, we multiply them together, and that’s the result of the input gate.

inp_gate=Nₜ*iₜ

We have input our new information X to the Cell State.

Scenario- Think you're preparing for SAT/GMAT, in the Reading comprehension session you need to first read the whole passage( Skimming)and take information which makes sense or relevant to the topic and add this new information to your memory in order to answer the questions further.

Update gate

This is a process of updating the cell state as it forgets the things we decided to forget earlier and add the new values, scaled by how much we decided to update each state value.

In this gate, the long term memory(Cₜ₋₁) is being updated. It is the sum of the forgotten gate data and newly added information from the input gate.

Update_gate=fₜ+inp_gate

Output Gate

With the new long term memory, it now filters out the output as to get the hidden state and send it as an output and also send it as an input to the next LSTM layer. It takes the Cₜ₋₁ and sigmoid(input(Xₜ),Hₜ₋₁) multiplies both and squashes the data using a tanh activation function.

GRU- Gated Rectifier Unit

This is a similar version of an LSTM but minute changes in how the gates and cell state perform inside the network. It has no output gate,

Rise and Fall of LSTM

In the initial LSTM has been a goto for many NLP tasks such as text generation, sentiment analysis so on.LSTM can be used in order to extract the deeper context of the data. It can be compared with the receptive field of an image, deeper the layers more receptive the feature to the filter.LSTM are better at memorizing long data but when compared to computation power it is a bit slow as compared to the state of the art model. Using RNN for the large dataset will consume more data and resources which has made RNN/LSTM a bit old fashioned. Currently, Attention mechanism (Transformer) is considered as the state of the art.

Let us give attention to the attention mechanism

Attention is everything you need !!!!! :-)

Disclaimer: The above photo shows that we don’t read data sequentially, in fact, we interpret characters, words and sentences as a group. An attention-based mechanism perceives the sequence and outputs a contextual representation in our mind. We wouldn’t be misreading this if we have read this information sequentially! We would obviously stop and notice the discrepancies!!!

Example

Rahul is “driving” a blue “car”

In the above text, the word driving will have a higher closeness to vehicles i.e these both words have high attention as compared to the colour of the car.

Based upon the attention the weights of these respective words are assigned.

In the old Encoder-Decoder architecture i.e seq2seq model, we always translate the whole text after looking at all the inputs from the encoder and then send the whole input into a vector and send it to the decoder to translate.

Due to this type of translation, there are two disadvantages:

It can’t withhold long text or context.
Loss of information as converting all the inputs into one vector. ( order and number of words of the information might change the whole meaning)

Due to these issues the attention mechanism has come into picture.