The main idea of the transformers is to combine the attention model concept with the parallel computation concept of CNN(In CNN we input the data parallel to the network).
Self-Attention:
The self-attention model of the transformer will work similarly as we discussed self-attention models in the previous blog. But the key Idea here is that for each word it uses the neighbor words to achieve the 3 values q,k,v which help us to attain the attention value for the word we are looking for.
The formulation can be seen in the above image. But let's see a more in-depth understanding in the below slide.
Let's first try to understand what are q,k, and v values. In simple terms, q is the question that we are going to ask about the word that we are looking for example "What's happening in Africa?". The key value will look for the words around that match the word that we are looking at, for example, Africa and visit have the highest cross product as highlighted in the image. Finally, the value(v), helps them to represent how the work visit should be represented within the word Africa so that it can contribute to the representation of the word Africa based on the surrounding words. In this way, instead of one simple word embedding, we will represent a word based on its surrounding word we will use different self-attention for these words.
The formulation for these values can be seen above in the images. First, these q,k, and v are derived by training weights and input X. Then the q and k values are multiplied using the inner product, and thereafter we will apply softmax to get it summed over to 1. Then we use v values to obtain the perfect representation for self-attention activation. Similarly, for each word, we parallelly Compute the attention representation.
Multi-Head Attention:
In general, calculating one set of attention weights for each word in the sentence is called head. Similarly, we will now calculate n different heads parallelly. For each word, we concatenate these n-head activation values by multiplying them with a set of weights to obtain the final attention value on the top.
Combining all the details together:
As we can see in the image, there will be an encoder block in which we will take the input sentence and calculate self-attention values. Then multiheaded self-attention values are passed to a feed-forward neural network. This process is repeated N number of times. Here Note that transformers also give importance to positional encoding that is the place or the position where the input word is present and this is combined with the input and passed through the encoder. Similarly, we do a similar add&Norm process to the batch normalization process to speed up the process.
Now while coming to the decoder side, we will do two multi-head attention blocks. The first block will contribute by taking the previous output as input it will try to produce the q value for the next block. Here if we see the second multi-head attention block instead of taking input values (k,v) from the previous block, it will try to take these values from the encoder output. So that it also avoids the long-term memory problem. The attention values from these are passed to the Feed-forward neural network, based on this the NN will try to predict the next possible word and it is done by softmax activation as described in the image. This decoder block is also repeated N times for each word in the sequence before going forward to the next word.
The iteration happens in this way in the decoder. (<sos>,<sos> jane, <sos> jane visit,..)
Note that the positional encoding is done both at the side of the decoder and encoder and the values are calculated by the below formula, Here d is the number of blocks that we need to represent the word using word embedding.
Note during training since we know both the input and output. the first block in the decoder acts as a masked multi-head attention model, that is it tries to mimic the testing phase in the training as well. Here is an example of how this works (given a perfect translation "Jane visits Africa in" it now tries to mimic to know whether the algorithm can predict the next word or not."
This is not a completely clear understanding. we will revisit this sooner.
IN TERMS OF NUMBERS LET'S SEE HOW IT WORKS:
Encoder :
Suppose you have the following word embeddingWe will then positional encode them as follows
the lines connecting encodings and query and key represent weights that we will learn to optimize during training.
The next step is to perform the q* k operation for similarity analysis of each word.
Now that we have obtained two new values from this operation, these values represent how each word is related to itself and to the word "go." In this way, if we have n words in our sentence, we will have n* n total values, where each word will have one representation of itself and n-1 representations of the other words.
Note that the k,q, and v weights will be the same across all the words for a single head.
The next step is to add a residual connection
This connection will ensure that the information from the original data is not lost due to long-term dependencies pushing the calculations to happen faster due to the increase in the magnitude of these self-attention values.
Decoder:
same word embedding:
we can start with either <EOS> or <SOS. Here in the example let's consider the <EOS>.
Now that we have created values embedding we will also create self-attention values for the first word in the decoder
Now we will again create a query from these self-attention values and then take keys from each word in the encoder.
Note that the weights used here are different from the encoder.
Similar to the encoder let's add residual connections.
Now that we have these attention values for <EOS> now we will pass these into the neural network to predict next word as below
Now in the next stage, we will consider the word vamos as input to the decoder and the process is repeated
We can add a neural network in the encoder as well. We can also add normalization steps as explained in the above theory.
we can also use different similarity parameters instead of a dot product. In the original research paper, they have used the following
In this way, we will implement multi-head self-attention values and concatenate them using some weight.
Comments
Post a Comment