References and useful ressources

What is a Transformer?

Transformers are a type of machine learning model designed for processing sequences. They were introduced in the paper "Attention Is All You Need". The original paper focus on NLP tasks, but it can be applied to other types of data as well.

Motivations

Model architecture

Unlike RNN models, Transformers doesn't process inputs sequentially.

transformers_01.png

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.

Encoder: The encoder is composed of a stack of N = 6 identical layers.

Decoder: The decoder is also composed of a stack of N = 6 identical layers.

Ancestors of the Transformers:

⚠️ TODO: complete and explain

Word Embedding

word_embedding.png

Positional encoding

To compute the positional encoding, we use the cos and sin graphs with various wavelengths. The number if positional encoding graphs match the embedding dimensionality. And finally the positional embedding is added to the word embedding.

⚠️ We use the same cos and sim graphs for the encoding and decoding parts of the Transformer.

positional_encoding.png

positional_encoding_02.png

Self attention

It works by seeing how similar each word is to all of the words in the sentence (including itself).

self_attention_01.png
self_attention_02.png
self_attention_03.png

It's done with all the tokens indeed

Compute similarity scores

(for all tokens)

Words that are more often associated and proximate to each others in text corpus will get higher similarities (e.g. "it" and "pizza" or "it" and "oven" will have high scores, whereas "The" and "came" will not). And these similarity scores are used to compute the Query and Key values.

Compute Query and Keys

(for the token "Let's")

  1. we compute the Query value for "Let's"
  2. we compute the Key value for "Let's"
  3. we compute the Key value(s) for the other tokens of the sentence.

⚠️ We use only one set of weight to compute all Keys and only one set of weight to compute all Queries (in fact, one for the Encoder Self-Attention, one for the Decoder Self-Attention and one for the Encoder-Decoder Attention), but as we apply them on different positional encodings, the results are different.

self_attention_04.png

At the bottom we can see the input, then the embedding, then the position encoding (with the cos and sin symbols), then the result of the word and position encoded values... finally of top of all of these, we can spot the similarity scores used to compute the Query and Key values.

Compute the Dot Product similarity between the Query and the Keys

(for the token "Let's")

This is simply done by applying any similarity function between the Query and each Key.

The orignal paper use: Sim=Dot productembedding size but in this example we use Dot product for simplicity.
e.g. (1.0×4.7)+(3.7×1.9)=11.7 and (1.0×1.9)+(3.7×0.2)=2.6

self_attention_05.png

We can see that "Let's" is more similar to itself (11.7) than to "go" (-2.6)

Compute the influence of each tokens on the current token

(in this case the "current" token is "Let's")

We use a softmax function (which sums up to 1.0) to compute how much each token should influence the final Self-attention score for this particular token.
e.g. softmax(11.7,2.6)=[1.0,0.0]

self_attention_06.png

Compute the token Value to be influenced

(for the token "Let's" on this schema, but in fact we do it for all tokens at the same time)

Now that we know how much each token of the sentence can influence the current token value, we need to actually get a Value to influence...

⚠️ We use only one set of weight to compute all Values (in fact, one for the Encoder Self-Attention, one for the Decoder Self-Attention and one for the Encoder-Decoder Attention), but as we apply them on different positional encodings, the results are different for each token.

self_attention_07.png

Compute the influenced intermediary embeddings by scaling the values

(for all tokens)

Here, we use the influence ratio on the current token Value to get intermediary values...
e.g. [(2.51.0),(2.11.0)]=[2.5,2.1] and [(2.70.0),(1.50.0)]=[0.0,0.0]

Screenshot from 2024-03-28 16-27-04.png

Compute the Self-attention values

(for the token "Let's")

Finally, by summing the intermediary embeddings for all tokens of the sentence, we compute the Self-attention values for the current token.
e.g. [2.5,2.1]+[0.0,0.0]=[2.5,2.1]

self_attention_08.png

Repeat for the next tokens in the sentence...

Now that we have computed the Self-attention values for the current token, it's time to start over with the next token and the next one etc. until the end of the sentence.

⚠️ But, no matter how many words are input in the Transformer we can just reuse the same sets of weights for Queries, Keys and Values inside a given Self-Attention unit (in fact need one of each for the Encoder Self-Attention, one of each for the Decoder Self-Attention, and one of each for the Encoder-Decoder Attention).

So once we computed the values for the first token, we don't need to recalculate the Keys and Values for the others, only the Queries. And then we do the math to get the Self-attention values.

  1. Compute the Dot Product similarity between the Query and the Keys
  2. Compute the influence of each tokens on the current token (using SoftMax)
  3. Compute the influenced intermediary embeddings by scaling the values
  4. Compute the Self-attention values

However, we don't really need to compute them sequentially. Transformers can take advantage of parallel computing, so we can compute the Queries, Keys and Values all at the same time using vectorization.