Inside Transformers: An In-depth Look at the Game-Changing Machine Learning Architecture — Part 5

4 min readFeb 14, 2024

Note: AI tools are used as an assistant in this post!

Let’s continue with the components.

Add & Norm

After the multi-head attention block in the encoder and also in some other parts of the transformer architecture, we have a block called Add&Norm. The Add part of this block is basically a residual block, similar to ResNet, which is basically adding the input of a block to the output of that block: x+layer(x) .

Normalization is a technique used in deep learning models to stabilize the learning process and reduce the number of training epochs needed to train deep networks. In the Transformer architecture, a specific type of normalization, called Layer Normalization, is used.

Layer Normalization (LN) is applied over the last dimension (the feature dimension) in contrast to Batch Normalization which is applied over the first dimension (the batch dimension). Basically, we do Layer Normalization across the last dimension (which is the dimension of the features, or ‘channels’, or ‘heads’ in multi-head attention). This means that each feature in the feature vector has its own mean and variance computed for normalization, and this is done for each position separately.

In Batch Normalization, you calculate mean and variance for your normalization across the batch dimension, so you normalize your feature to have the same distribution for each example in a batch.

In Layer Normalization, you normalize across the feature dimension (or channels, or heads), and this normalization is not dependent on other examples in the batch. It’s computed independently for each example, hence it’s more suited for tasks where the batch size can be variable (like in sequence-to-sequence tasks, such as translation, summarization etc.).

Unlike Batch Normalization, Layer Normalization performs exactly the same computation at training and test times. It’s not dependent on the batch of examples, and it has no effect on the representation ability of the network.

In the Transformer model, Layer Normalization is applied in the following areas:

1. After each sub-layer (Self-Attention or Feed-Forward): Each sub-layer (either a multi-head self-attention mechanism or a position-wise fully connected feed-forward network) in the Transformer is followed by a Layer Normalization step. This is combined with residual connections.

2. Before the final output layer: The output of the stack of decoder layers is also normalized before it is fed into the final linear layer and softmax for prediction.

The addition of Layer Normalization in these areas helps to stabilize the learning process and allows the model to be trained more effectively. Moreover, it also aids in achieving higher performance and faster training times.

source: Introduction to Deep Learning — Raschka

Conclusion

So, there you have it. We’ve traveled through the heart of the Transformer model and gotten a look at how this game-changing architecture works. From seeing how they handle dependencies and positional encoding to understanding the importance of attention processes, it’s easy to see why Transformers are making such a big splash in machine learning.

But hey, don’t forget that this is just the start! As we’ve seen, Transformer models aren’t just used for language tasks. They show up everywhere in machine learning, changing how we understand and process sequences. Who knows where we’ll see Transformers next? Research and new ideas are always being made.

No matter how long you’ve been in the field or how new you are, I hope this deep look has helped you understand the Transformer model and made you want to learn more. It’s pretty cool, right?

Don’t stop here, though. Continue to learn, look around, and ask questions. We keep pushing the limits in AI because that’s what we do.

For now, that’s all. Until next time, happy learning, and cheers to the great world of AI that keeps us all on our toes!

Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a like and sharing it with your friends. Your support means the world to me and helps me to continue creating valuable content for you.

References

CS25 I Stanford Seminar — Transformers United 2023: Introduction to Transformers w/ Andrej Karpathy

GitHub — karpathy/nanoGPT: The simplest, fastest repository for training/finetuning medium-sized…

The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of minGPT that prioritizes…

github.com

A Gentle Introduction to Positional Encoding in Transformer Models, Part 1

What is the positional encoding in the transformer model?

begingroup$ For example, for word $w$ at position $pos \in [0, L-1]$ in the input sequence $\boldsymbol{w}=(w_0,\cdots…

datascience.stackexchange.com

Transformer Architecture: The Positional Encoding

Transformer architecture was introduced as a novel pure attention-only sequence-to-sequence architecture by Vaswani et…

kazemnejad.com

Understanding Positional Encoding in Transformers

The two keys to understanding Transformer models

medium.com

Transformer’s Positional Encoding: How Does It Know Word Positions Without Recurrence? — KiKaBeN

In 2017, Vaswani et al. published a paper titled “Attention Is All You Need” for the NeurIPS conference. They…

kikaben.com

Position Information in Transformers: An Overview

Abstract. Transformers are arguably the main workhorse in recent natural language processing research. By definition, a…

direct.mit.edu

Attention? Attention!

Updated on 2018–10–28: Add Pointer Network and the link to my implementation of Transformer.] [Updated on 2018–11–06…

lilianweng.github.io