Jalammar Transformers Blogpost

  • 6 encoder + 6 decoder
  • Each encoder is the same in the structure yet they do not share weights

Steps

Given a sentence S = {}

  1. Calculate vector embedding for each
  2. Sum with positional embedding =
  3. Calculate query, key, value vectors for each
  4. Calculate score of each position by the following formula
    1. = Z
  5. Assume we applied all above in a multi-head attention manner with 8 heads. We will obtain 8 different Z matrices. We concat them and multiply with an additional weight matrix
  6. Transform last encoders output into K and V and feed into each of the decoders.
  7. Using the K and V first cycle produces and output. After the first cycle, each cycle begins with the previous output. Decoding phase ends when EOS token is produced.
  8. The last layer of decoder outputs a vector of floats of size 1x512. The linear layer and softmax function at the end produces a huge logits vector of size our vocabulary. The cell with the highest probability is chosen to be the corresponding word.