Inside Transformers: An In-depth Look at the Game-Changing Machine Learning Architecture — Part 5

Isaac Kargar
4 min readFeb 14, 2024

Note: AI tools are used as an assistant in this post!

Let’s continue with the components.

Add & Norm

After the multi-head attention block in the encoder and also in some other parts of the transformer architecture, we have a block called Add&Norm. The Add part of this block is basically a residual block, similar to ResNet, which is basically adding the input of a block to the output of that block: x+layer(x) .

Normalization is a technique used in deep learning models to stabilize the learning process and reduce the number of training epochs needed to train deep networks. In the Transformer architecture, a specific type of normalization, called Layer Normalization, is used.

Layer Normalization (LN) is applied over the last dimension (the feature dimension) in contrast to Batch Normalization which is applied over the first dimension (the batch dimension). Basically, we do Layer Normalization across the last dimension (which is the dimension of the features, or ‘channels’, or ‘heads’ in multi-head attention). This means that each feature in the feature vector has its own mean and variance computed for normalization, and this is done for each position separately.

In Batch Normalization, you calculate mean and variance for your normalization across the batch dimension, so you normalize your feature to have the same…

--

--

Isaac Kargar

Co-Founder and CIO @ Resoniks | Ph.D. candidate at the Intelligent Robotics Group at Aalto University | https://kargarisaac.github.io/