Encoder and Decoder Problems (Language Transulation )

The Scenario is as follows:

The first part is called the Encoder, which continuously takes the input; the second is called the Decoder. The network circuits can be LSTM or GRU. This is how the language translation model works.

The other similar Example of this could be image captioning.

Here, NN acts as Encoder and RNN acts as Decoder.

How a regular language model differs from the Sequence Model.

In the general language models, the initial input would be a set of 0's as a<0>, but in the case of the Translation model, the initial input is given by the encoder. The major difference here is, that in Language models we simply find the probability that a given sentence is a correct prediction. But in the case of Translation, it is given by conditional probability.

As we can see in the above image, there could be many possibilities for a sentence when it is translated like above. But, our main goal is to find that sentence that maximizes the probability as mentioned above.

When people see the equation they might think why not simply find the first word with maximum probability and find the second word with maximum probability and soon.

But the problem with this approach is that as in the example shown above, both sentences seem correct but in reality, the sentence "Jane is visiting" is more accurate in general English, but the word 'going' has more probability since we are choosing the word with high probability. So to avoid these kinds of scenarios we move towards a beam Search algorithm.

Beam Search:

In Beam search, we maintain a value called B value, which in our case is assumed to be 3. And let's see how it works

In step 1 it finds 3 words with high probability given x, Since the B value is 3. In the second step, it tries to find P(y<1>,y<2>/x), But we will again choose 3 best probabilities for the third word. At each step, we check 3 * 10000 possibilities.

As we can see in the above image, the next there most probable words are indicated in the red lines. The key observation here is that we are skipping the word "September", though we got it in step 1. Now, the highest probability 3 (2 word combinations) will be considered for the next round.

This process will keep on happening repeatedly until the <EOS> appears.

Improvements to Beam Search:

Length Normalization:

Instead of choosing the first straight formula, we then try to add a log term to it. The logarithm term is added to it because while calculating the probability of each term(small value), we are multiplying small values again and again this will result in too small a value, to avoid that we added a log term to calculate additive terms instead of multiplication.

Similarly, when we process long translation sentences we keep on repeatedly adding too many negative values(because the log of small values results in negative) which can also give some positive importance to small-length sentences even though they are not accurate. So, to avoid that we then added the terms 1/(Ty)^a(alpha). The alpha value is a parameter, when it is 1, it results in an average.

How to Choose B?

larger B results in more computation (more examples).

Smaller B results in inaccurate results.

So choosing the B value is preferred based on your application requirements.

Error Analysis:

If there is a significant error in our language translation then how to decide which part the error could lie in. There are obviously 2 possibilities

Beam Search
RNN training

This can be decided in the following way.

First, we will try to predict the output using both RNN and Beam Search, once some output is predicted and it is known as an error, we then evaluate the probability of the right sentence given x, if the probability of the right sentence is hight then the probability of predicted sentence then Beam Search algorithm might be wrong. If the thing goes vice versa the RNN might be not trained properly. We will do this for every erroneous output. Based on that if the majority of error is Due to beam search we then try to alter the B value, in the same way, if the error is due to NN we then have to train the network optimally.

Bleu Score:
The bleu score is used to judge how good the machine-generated translation is. The working of it is as follows:

Here precision score is calculated as follows:

As we can see in the machine-generated translation there are 7 words in total which have been presented at the bottom and the top value is calculated as follows :

whether the first word in Machine Translation is present in any of the reference sentences, if yes the top score is added with 1. Since the word "the" is present in both the references the score is 7/7 (Indicates good prediction). which is not true.

So the bleu score using modified precision, which is obtained as follows

For the first word, it checks whether it is in reference words or not, if it is in any of the reference words it is given 1, but the only difference is that even if the word is repeated multiple times in MT, we will check the references for that word, then we will check what is the maximum times that the word we are searching has occurred in each reference sentence and we will give that score to that word no matter how many times it occurred in the MT.

This process is also done with bigrams and some n-grams:

Finally, the bleu score is calculated as follows

Here is an exponential average of all n-grams results. Here we used "BP" because it is so much possible for short sentences to have high scores, Since there are only a few words in the vocabulary to avoid that we use the "BP" penalty.

Attention Models:

As the length of the sentence increases, it is hard to translate, because for this the input needs to go through all networks in the encoder and then it has to go through the decoder.

Working:

As we can see in the image the below network is a bi-directional RNN encoder network and the above is a decoder network. But the main difference here is that we are trying to add some additional weights called attention weights(alpha terms). Where each weight contributes to one word in the input network and it goes to each and every node in the Decoder network. This indirectly tells a decoder how much each word is contributing to a certain output. In this way, we are avoiding long-sentence memory problems.

Now let's see the formulation:

All the formulations that are required can be seen in the above images. key points to address here are the attention weights for each decoder node sum up to 1 and it's achieved by using sofmax optimation. Similarly, the values inside the softmax are learned using a small neural network. Here the attention weights depend on 2 factors one is the previous input from the decoder cell and the activations from the corresponding encoder.

Speech recognition:

The same conventional attention method can be used for speech recognition as well. but here we will consider frequency as input. and the text as output. Now, when you closely monitor this, there will be a certain problem as below and can be addressed as mentioned below

It is not the best technique to solve but it is part of the speech recognition evaluation

since we monitor the sound/frequency of sound for every time interval, there could be some n number of input audio frequencies, but a relatively less number of English terms. So if we are using RNN like above. we could possibly generate output like the above, where the letters between the blanks are combined. Which produces a relatively good solution.

Machine learning guider

Search This Blog