The Scenario is as follows:
The first part is called the Encoder, which continuously takes the input; the second is called the Decoder. The network circuits can be LSTM or GRU. This is how the language translation model works.
The other similar Example of this could be image captioning.
Here, NN acts as Encoder and RNN acts as Decoder.
How a regular language model differs from the Sequence Model.
As we can see in the above image, there could be many possibilities for a sentence when it is translated like above. But, our main goal is to find that sentence that maximizes the probability as mentioned above.
When people see the equation they might think why not simply find the first word with maximum probability and find the second word with maximum probability and soon.
But the problem with this approach is that as in the example shown above, both sentences seem correct but in reality, the sentence "Jane is visiting" is more accurate in general English, but the word 'going' has more probability since we are choosing the word with high probability. So to avoid these kinds of scenarios we move towards a beam Search algorithm.
Beam Search:
In Beam search, we maintain a value called B value, which in our case is assumed to be 3. And let's see how it works
Improvements to Beam Search:
Length Normalization:
Instead of choosing the first straight formula, we then try to add a log term to it. The logarithm term is added to it because while calculating the probability of each term(small value), we are multiplying small values again and again this will result in too small a value, to avoid that we added a log term to calculate additive terms instead of multiplication.
Similarly, when we process long translation sentences we keep on repeatedly adding too many negative values(because the log of small values results in negative) which can also give some positive importance to small-length sentences even though they are not accurate. So, to avoid that we then added the terms 1/(Ty)^a(alpha). The alpha value is a parameter, when it is 1, it results in an average.
How to Choose B?
larger B results in more computation (more examples).
Smaller B results in inaccurate results.
So choosing the B value is preferred based on your application requirements.
Error Analysis:
If there is a significant error in our language translation then how to decide which part the error could lie in. There are obviously 2 possibilities
- Beam Search
- RNN training
This can be decided in the following way.
First, we will try to predict the output using both RNN and Beam Search, once some output is predicted and it is known as an error, we then evaluate the probability of the right sentence given x, if the probability of the right sentence is hight then the probability of predicted sentence then Beam Search algorithm might be wrong. If the thing goes vice versa the RNN might be not trained properly. We will do this for every erroneous output. Based on that if the majority of error is Due to beam search we then try to alter the B value, in the same way, if the error is due to NN we then have to train the network optimally.
Bleu Score:
The bleu score is used to judge how good the machine-generated translation is. The working of it is as follows:
The bleu score is used to judge how good the machine-generated translation is. The working of it is as follows:
As we can see in the machine-generated translation there are 7 words in total which have been presented at the bottom and the top value is calculated as follows :
whether the first word in Machine Translation is present in any of the reference sentences, if yes the top score is added with 1. Since the word "the" is present in both the references the score is 7/7 (Indicates good prediction). which is not true.
So the bleu score using modified precision, which is obtained as follows
For the first word, it checks whether it is in reference words or not, if it is in any of the reference words it is given 1, but the only difference is that even if the word is repeated multiple times in MT, we will check the references for that word, then we will check what is the maximum times that the word we are searching has occurred in each reference sentence and we will give that score to that word no matter how many times it occurred in the MT.
This process is also done with bigrams and some n-grams:
As the length of the sentence increases, it is hard to translate, because for this the input needs to go through all networks in the encoder and then it has to go through the decoder.
Working:
As we can see in the image the below network is a bi-directional RNN encoder network and the above is a decoder network. But the main difference here is that we are trying to add some additional weights called attention weights(alpha terms). Where each weight contributes to one word in the input network and it goes to each and every node in the Decoder network. This indirectly tells a decoder how much each word is contributing to a certain output. In this way, we are avoiding long-sentence memory problems.
Now let's see the formulation:
All the formulations that are required can be seen in the above images. key points to address here are the attention weights for each decoder node sum up to 1 and it's achieved by using sofmax optimation. Similarly, the values inside the softmax are learned using a small neural network. Here the attention weights depend on 2 factors one is the previous input from the decoder cell and the activations from the corresponding encoder.
Speech recognition:
The same conventional attention method can be used for speech recognition as well. but here we will consider frequency as input. and the text as output. Now, when you closely monitor this, there will be a certain problem as below and can be addressed as mentioned below
It is not the best technique to solve but it is part of the speech recognition evaluation
since we monitor the sound/frequency of sound for every time interval, there could be some n number of input audio frequencies, but a relatively less number of English terms. So if we are using RNN like above. we could possibly generate output like the above, where the letters between the blanks are combined. Which produces a relatively good solution.
Comments
Post a Comment