RNN Indepth Intution

NOTE: all the images were taken from different sources. And image credits go to them.

Why do we need RNNs?

In tasks like NLP, the input sequence matters a lot, but with the simple NN, we cannot process the sequence of information.
In the case of NLP data, the size of the data may be variable. Though using 0 padding to solve this is debatable, choosing RNN is the optimal task.

How it works:

In RNN, we use the network in the form of time stamps. That is the output from the previous Network is again passed as input to the same network along with new input in the new time stamp.

There are different types of RNN as below

1. Many to one

2. One to many

3. Many to many

4. Encoder decoder type

Forward propagation:

If this is the RNN at 4 different time stamps then the forward propagation is defined as

Backpropagation:

Example Scenario:

Training of RNN for predicting the next word in the sequence:

In this scenario, the training set contains a corpus of input sentences as the one in the above picture, And each sentence is tokenized based on the concept of word embedding. The last sentence in the image reports that if there is a sentence in which one of the tokens is not in our training set, it is treated as an unknown token.

Let's see how training happens based on this sentence.

The first RNN block learns the following tasks

Initially, since we don't have any activation or input, 0's will be given as input. In this case, the model tries to find the probability of a word in the corpus of words that we have.

In the second block, given the word "cats " the model now ties predicting the probability of the next word in the sentence.

Note that the model takes input at present unit and activation from the previous unit as a combined input. In this way, each sentence in the corpus is first tokenized and then trained using RNN.

The training involves standard back-propagation based on the concepts discussed above.

At each block the output is compared to the actual word to produce a loss at that position, in this way, the loss is calculated at every block resulting in summation to produce the total error.

Examples of sequence data:

Issues with regular RNN:

1. it only considers the information till the present word in the sequence, it won't consider the information after the present input.

Due to this issue the model fails to identify whether the word "Teddy" in the image refers to Teddy bears or Teddy Roosevelt.

2. Vanishing gradient problem, In reality, the size of sentences we process could be so long, so when we are giving input to RNN word by word, each word will go towards a bunch of layers, which could possibly reduce the influence of initial words on the later words, which can cause this issue.

Due to this issue the model failed to identify whether to use "was" or "were" at the end of the sentence, as shown in the image.

GRU:

why GRU?

As we have seen the Vanishing gradient descent issue in RNN, we are now trying to tackle it using GRU, which implements the concept memory cell.

Working of GRU

This image is taken from Andrew Ng's class. Now let's try to understand what's in here. As we can see, 'c' is the memory cell that we use to remember the words that could come in handy later when we are processing future words.

In simple words, assume gate(7) is an array that holds some value from 0 to 1 for each word in a sentence that we are processing. where 1 represents we are remembering that word and 0 means we are not. But please note that this value can change after every iteration.

The next question is how we remember the values using the memory cell. after each iteration, the activation from the previous layer is stored in the memory cell array. Note that the value is only stored on to memory cell based on gate value. That is if the corresponding gate value is 0.8 then the activation value from the previous will be stored with great magnitude. In this way, for every iteration, the memory value for the corresponding word is updated based on the gate value.

The formulas for the calculation can be seen here in the image.

What we can see above is a simplified version of GRU the much complex version could be using the below formula.

These different versions are derived by professors who have done massive research. There can be many alternative versions for this, but there is not a very solid reason for choosing this formula.

LSTMS:

The only difference between LSTM and GRU is the structure of the formula as shown below instead of one gate we use multiple gates. That is one for present memory value one for previous memory value and one for output.

the above image represents multiple connections.

There is one more variation for this as below

The above variation also suggests introducing the previous gate info in the present gates.

when to use what?

There is no such reason to decide what to use and when, but the most commonly used version is LSTM but the GRU is the latest in terms of invention. Based on your data you need to choose one. But the major constraint of LSTM is that it is computationally expensive.

BRNN:

Why BRNN?

The main reason for using bidirectional RNNs is that normal RNNs are dependent only on previous inputs and activations. But in scenarios like below, they cannot be used

Here if we consider name entity recognition, we cannot decide whether the word teddy is the name of a person or a toy based on past info in the sentence, but if we use the future words we can successfully attain that. This is where BRNN comes in handy.

Working:

The working of bi-directional RNNs can be easily understood by this diagram. Instead of building the model only from a forward direction we also build the model in a backward direction. So as described in the formula, at any point the output is now defined by both the forward words as well as backward words making it give efficient output and tackle the problem that we have talked about earlier.

Note that the blocks can be Native RNN, GRU, or LSTMS.

Deep RNNS:

As we can see in the image Deep RNNs can be as complex as above. You might think this is not very complex, But it is way more complex than you think. Here each cell could be an LSTM or GRU or even be bidirectional. As we all know each individual cell could be a quite large ANN as well. So from a complexity perspective, this is quite deep.

Now as you can see as described in the formula. the activation for a particular cell can be derived as given. But try to imagine if these are LSTM blocks, this activation a[2]<2> and a[1]<3 > can be computed based on ANN weights and based on a memory cell and forget cell.

This is one kind of Deep RNN, in which each line among three lines can be seen as an individual implementation at different time stamps but gaining info from the bottom direction as well. In other words, we can also see that instead of directly taking the tanh(activation value) as output we now take the activation and build another layer of RNN with other different weights and keep on doing it again for all the words that we have as input.

Another version could be instead of a horizontal connection they can have a set of Vertical connections as follows

Machine learning guider

Search This Blog