In order to better understand RNNs, let us take a look at an example of a machine translation model.
Here the model has 2 parts : Encoder and Decoder. We feed in the tokenzied input to encoder one-by-one. For the below example, we first feed the word Comment to encoder, which geneates it corresponding hidden layer #1. Next, we feed encoder, the 2nd word allez. The hidden layer #2 is formed using hidden layer #1 plus tokenzied word and so on. Finally the hidden layer #3 generated for the last word is what is fed as context to decoder.

The drawback of RNN or any sequence model is that it is confined to sending a single vector, no matter how long or short the input sequence is. Chosing the size for this vector makes the model have problems with long input sequences. In this case, one may suggest to use large sizes of hidden layers, but in this case, your model will overfit for short sequences. This is the problem that Attention solves.
The first part of encoder is similar to sequence-to-sequence model, i.e. generating hidden layers one word at a time. The difference comes when we create context. Here instead of passing just one final hidden state as context to decoder, the encoder passes all of the hidden states to decoder.
This gives us the flexibility of context size, i.e. longer sequences can have longer context vectors. One point to note here is that every hidden layer captures the most escence of its corresponding word. For e.g. hidden layer #1 will have more information about the word comment as compared to other words.

Lets have a closer look at our attention encoder and we refer to the same example of language translation from French to English.
- First, we pass the french sentence to an embedding look-up table, that stores the vectorized form of words in a table.

- After we have the embedding layer, when we feed the first word into the first time-step of RNN produces the first hidden state of RNN. This is known as Unrolled View of RNN, where we can see RNN at each timestep.

-
After we have processed the entire sequence, we are ready to feed in the hidden state to the attention decoder. At first, attention decoder will allocate scores to the hidden states generated from the input. After this, decoder feeds the scores to a softmax function whic makes sure that all scores are positive, between 0 to 1 and all of them adds up to 1.

-
After assigning softmax scores, decoder creates a Context vector by multiplying hidden states with their respective softmax scores.
-
After creating a Context vector, decoder generates a new hidden state and an output (translated word).

-
The same process is repeated till the end of sequence is reached.

In the previous points, our model needs to generate Attention Weight Vector in order to emphasise relevant parts of input sequence. An attention weight vector takes into account, hidden state of decoder and set of hidden state of encoder. So, when we pass hidden states of encoder to decoder -

- Decoder generates its own hidden state.
- Performs attention. a. Generate scores for encoder's hidden states. b. Apply softmax. c. Sum steps a, b to produce Context Vector
- Concat Context Vector with hidden state of decoder.
- Pass the concatinated product to fully connected layer (multiplying by Wa)
- Pass through activation function (tanh)
- Generate a new word.
- Repeat till end of sequence is reached.
As the name suggests, we concat the two hidden states to generate the Context Vector as below
- Machine translation
- Document Summarization
- Dailgoue Exchange
- Image Caption Generator

