Inside Transformers: An In-depth Look at the Game-Changing Machine Learning Architecture — Part 3

7 min readFeb 14, 2024

Note: AI tools are used as an assistant in this post!

Let’s continue with the components.

Self-Attention Mechanism

The self-attention mechanism is the core component of the transfoemr architecture. Here is the main procedure in the self-attention mechanism:

1- Derive attention weights: similarity between the current input and all other inputs. It means that if you have an input sequence of x1, x2, …, xT, the self-attention mechanism will find the interaction/similarity between xiand xj .

2- Normalize weights via softmax: It then takes softmax over the similarities to have a kind of probability distribution over all x s. The sum of all attention weights would be 1.

sum(aij, j=1,...,T) = 1

3- Compute the attention value from normalized weights and corresponding inputs: To compute the attention value for thei_th input, we can use the following formula:

Ai = sum(aij * xj, j=1,...,T)

It is basically a weighted average over all inputs, and the output is again a vector that is context-aware.

One easy and basic way to compute attention weights is to just consider them as the dot product between input vectors as follows without any learning:

eij = xi^T . xj
aij = softmax([eij], j=1,...,T) = exp(eij) / sum(exp(eij), j=1,...,T)

The following image shows the procedure in a clear way. For x^(i), the embedding of the i-th token, we can calculate the dot product of it by all token embeddings from 0 to T to calculate the attention weights ( aij in our formula and wij in the image). Then we can calculate the output as a weighted average of all token embeddings. This new output/embedding of the i-th token ( Ai in our formula and o^(i) in the image) is a context-aware embedding for the i-th token and knows about the interaction between the i-th input and all other inputs.

source: Python Machine Learning — Third Edition — Raschka & Mirjalili

We can also learn the attention weights. But how? Using self-attention mechanism!

The idea is to add three trainable matrices, which will be multiplied by the input embeddings to calculate the following three values:

query = Wq * xi
key = Wk * xi
value = Wv * xi

Each input embedding will be multiplied by these three matrices as above and the query, key, and value will be generated.

source: Introduction to Deep Learning — Raschka

Let’s suppose we’re trying to determine the interaction of the second input with all the other inputs, denoted as A2. This requires us to compute the dot product of the second input’s query vector, represented by q2, with the key vectors of all inputs, from k1 to kT. This will yield the attention weights, ranging from a21 to a2T. Afterward, we’ll apply the softmax function to normalize these attention weights. Finally, we can calculate the weighted sum of all the value vectors, extending from v1 to vT. The following image shows all dimensions and also the attention embedding vector for the second input:

Basically, we start with a token embedding of size de and calculate a vector with the dimension of dv as output.

The following image shows more examples and how to create the attention score matrix:

You can also see the following image to get a better understanding of the dimensions and also how to calculate A in a matrix form and not only for one input:

But, as you see, we have one scaling value in the formula: sqrt(dk) or temperature as used in some papers. Why do we need this term?

I asked ChatGPT to see what it/he/she thinks. Here is the answer:

The scaling part of the Scaled Dot Product Attention formula helps mitigate a potential issue that arises when computing the dot products of high-dimensional vectors. When the dimensionality of the Key vectors (dk) is large, the dot products can become very large as well. This is because the dot product is a sum of element-wise products, and as the number of elements in the vectors increases, the sum can grow.
When the dot products become large, the input to the softmax function can have a wide range of values. The softmax function tends to squash large input values into very small output values, and in extreme cases, this can lead to numerical instability. Additionally, large input values can cause the softmax to produce a highly peaked distribution, which can result in vanishing gradients during backpropagation and slow down the training process.
To address this issue, the dot product values are scaled down by dividing them by the square root of the dimension of the Key vectors (sqrt(dk)). This normalization step helps maintain a more manageable range of values for the input to the softmax function. The scaling factor, sqrt(dk), is chosen because it is the expected value of the dot product of two random vectors with zero mean and unit variance.
The choice of the scaling factor sqrt(dk) is based on the assumption that the elements of the Query and Key vectors are independent random variables with zero mean and unit variance. Under this assumption, the dot product of two such vectors has an expected value of 0 and a variance of dk.
Let’s dive into this concept a bit more. Consider two random vectors u and v, each with n elements, where the elements are independent and identically distributed random variables with mean 0 and variance 1. The dot product of u and v is:
u · v = Σ(u_i * v_i), where i ranges from 1 to n.
Since u_i and v_i are independent, the expected value of their product is:
E[u_i * v_i] = E[u_i] * E[v_i] = 0 * 0 = 0.
Thus, the expected value of the dot product is:
E[u · v] = ΣE[u_i * v_i] = Σ0 = 0.
The variance of the dot product is given by:
Var(u · v) = ΣVar(u_i * v_i), as the elements are independent.
Since u_i and v_i have unit variance, Var(u_i * v_i) = Var(u_i) * Var(v_i) = 1 * 1 = 1.
Therefore, the variance of the dot product is:
Var(u · v) = ΣVar(u_i * v_i) = Σ1 = n.
So, for vectors with dimensionality n, the dot product has an expected value of 0 and variance of n. In the case of the Scaled Dot Product Attention, we have n = dk.
Dividing the dot product by sqrt(dk) essentially normalizes the dot product values and scales down the variance to 1, which helps maintain a manageable range of values for the input to the softmax function. This normalization reduces the chance of having large values that could lead to vanishing gradients and numerical instability during training.
If you used a scaling factor of dk instead of sqrt(dk), you would be over-correcting for the variance of the dot product, resulting in a variance smaller than 1. This could cause the softmax function to produce a more uniform distribution, which might negatively impact the model’s ability to capture the importance of the most relevant Key-Value pairs in the attention mechanism. The choice of sqrt(dk) balances the need to control the range of dot product values without overly suppressing their variation.
The sqrt(dk) is the standard deviation (std) of the dot product of two random vectors with zero mean and unit variance when the dimension of the vectors is dk.
Recall that the variance of the dot product of two such vectors was derived as Var(u · v) = dk. To obtain the standard deviation, we take the square root of the variance:
std(u · v) = sqrt(Var(u · v)) = sqrt(dk).

It also told me the following after some interactions about the above reasoning:

The choice of the scaling factor sqrt(dk) is not explicitly derived from the standard deviation of the dot product of two random vectors in the original paper “Attention is All You Need” by Vaswani et al. (2017).
However, the choice of sqrt(dk) can be intuitively justified by the discussion of the dot product of high-dimensional vectors with zero mean and unit variance. The idea is that dividing by sqrt(dk) helps to normalize the dot product values, preventing them from becoming too large and maintaining a manageable range of values for the input to the softmax function.

We will continue in the next post.

Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a like and sharing it with your friends. Your support means the world to me and helps me to continue creating valuable content for you.

Inside Transformers: An In-depth Look at the Game-Changing Machine Learning Architecture — Part 3

Self-Attention Mechanism

Written by Isaac Kargar

No responses yet