Jalammar Transformers Blogpost

Given a sentence S = { $w_{1}, ..., w_{n}$ }

Calculate vector embedding $x_{n}$ for each $w_{n}$
Sum $x_{n}$ with positional embedding = $z_{n}$
Calculate query, key, value vectors $q_{n}, k_{n}, v_{n}$ for each $z_{n}$
Calculate score of each position $z_{n}$ by the following formula
1. $so f t ma x (\frac{Q x K ^{T}}{d _{k}}) V$ = Z
Assume we applied all above in a multi-head attention manner with 8 heads. We will obtain 8 different Z matrices. We concat them and multiply with an additional weight matrix $W^{o}$
Transform last encoders output into K and V and feed into each of the decoders.
Using the K and V first cycle produces and output. After the first cycle, each cycle begins with the previous output. Decoding phase ends when EOS token is produced.
The last layer of decoder outputs a vector of floats of size 1x512. The linear layer and softmax function at the end produces a huge logits vector of size our vocabulary. The cell with the highest probability is chosen to be the corresponding word.

🪴 Quartz 4.0