Inside Transformers: An In-depth Look at the Game-Changing Machine Learning Architecture — Part 1

Isaac Kargar
7 min readMay 29, 2023

--

Note: AI tools are used as an assistant in this post!

Generated by Microsoft Bing Image Creator

As the field of artificial intelligence (AI) continues to change at a rapid pace, some designs have stood out for how much they have changed the field. The Transformer model has become a game-changer among these. It has changed not only natural language processing, but also many other parts of machine learning.

In their seminal work “Attention is All You Need” from 2017, Vaswani et al. introduced the Transformer, which changed the way we understand and process sequences. The attention system, which was the key innovation of the Transformer model, changed the way machine learning works by making sequence-to-sequence tasks easier to do and making it easier to deal with long-range dependencies in data.

But what is it that makes Transformers so strong? How does it use attention mechanisms to successfully store information about where things are and how they depend on other things? And why has it become the go-to architecture for many modern machine learning jobs, even outside of natural language processing?

In this blog post, we’ll take a close look at how the Transformer architecture works on the inside. We’ll look at its main parts, from inputs and embeddings to the multi-head attention system and positional encoding, all the way to outputs. We will figure out how it changed the way sequence modeling was done and how its design makes it useful for a wide range of machine learning jobs.

This guide aims to explain how the Transformer model works, whether you are an experienced machine learning engineer, a researcher, or someone new to the field who wants to learn about one of the most important designs in AI. Join us as we learn more about this new technology that is really changing the way artificial intelligence works.

Before I get into the transformer architecture blocks, I’ll quickly talk about the history of the attention mechanism. Here are other posts from this series:

Attention Mechanism Brief History

The history of attention mechanisms in deep learning can be traced back to the development of recurrent neural networks (RNNs) and has evolved into the powerful architecture of transformers, which have become a dominant force in the field of natural language processing (NLP) and beyond. Here’s an overview of the key milestones in the history of attention mechanisms:

1. Recurrent Neural Networks (RNNs): RNNs were introduced in the late 1980s and early 1990s as a way to model sequential data. These networks are designed to maintain an internal state or “memory” that can capture information from previous time steps. However, RNNs suffer from the vanishing and exploding gradient problems, which make it difficult to capture long-term dependencies in sequences.

2. Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU): To overcome the limitations of RNNs, LSTMs were introduced by Hochreiter and Schmidhuber in 1997, and GRUs were later proposed by Cho et al. in 2014. These architectures use gating mechanisms to selectively update and forget information, enabling them to capture longer-range dependencies more effectively than traditional RNNs.

3. Neural Machine Translation (NMT) and Seq2Seq Models: In 2014, Sutskever et al. introduced the Sequence-to-Sequence (Seq2Seq) model, which uses an encoder-decoder architecture for tasks like machine translation. The encoder processes the input sequence and generates a fixed-size context vector, while the decoder generates the output sequence based on this context vector. However, this fixed-size representation can limit the model’s ability to handle long sequences.

4. Attention Mechanism: The attention mechanism was introduced by Bahdanau et al. in 2015 as a way to address the limitations of fixed-size context vectors in Seq2Seq models. Instead of compressing the entire input sequence into a single context vector, attention allows the model to weigh different parts of the input sequence when generating each token in the output sequence. This improves the performance of neural machine translation and other sequence-to-sequence tasks.

The encoder-decoder model with additive attention mechanism in Bahdanau et al., 2015., source2

5. Self-Attention and Transformers: In 2017, Vaswani et al. introduced the Transformer architecture, which relies on self-attention mechanisms to process input sequences. Transformers eliminate the need for recurrent connections and instead use a series of multi-head self-attention layers to process input tokens in parallel. This design allows for more efficient training, better handling of long-range dependencies, and improved scalability.

Since their introduction, Transformers have become the foundation for many state-of-the-art NLP models, such as BERT, GPT, and RoBERTa. Attention mechanisms have also been applied in other domains, such as computer vision and reinforcement learning, demonstrating their versatility and effectiveness in a wide range of applications.

We will go into the details of the self-attention mechanism and Transformer’s architecture in this blog post. To learn more about the works before Transformers, you can read this awesome blog post.

Transformer Architecture

The following slide shows the transofmer architecture:

source: Introduction to Deep Learning — Raschka

Here is a high-level overview of the Transformer model pipeline:

1. Input: The model takes a sequence of words as input. The words are tokenized, and each token is represented by a unique integer.

2. Embedding: The integers are then converted into fixed-length vectors through an embedding layer. This vector representation captures the semantic meaning of the word.

3. Positional Encoding: Since Transformer models don’t inherently understand the order of words in a sequence (as they are not recurrent), a positional encoding is added to the word embeddings. This injects information about the position of the words in the sequence. The positional encoding can either be learned or be a fixed function of position.

4. Self-Attention (or Scaled Dot-Product Attention): The heart of the Transformer model. It allows the model to weigh the relevance of each word when encoding a particular word. It’s a way of getting the context of each word in relation to all other words in the sentence. For a given word, it quantifies the ‘attention’ it should pay to all other words for a particular task.

5. Multi-Head Attention: This mechanism allows the model to focus on different positions, capturing various features from different perspectives. Essentially, it runs the self-attention mechanism in parallel multiple times (heads) with different learned linear transformations of the input, and then concatenates and transforms the results.

6. Feed-Forward Neural Networks: These are present in both the encoder and decoder. After multi-head attention, the output is passed through a feed-forward neural network independently for each position.

7. Layer Normalization: This is a technique to stabilize the learning process and accelerate training. Layer normalization is applied after each multi-head attention block and the feed-forward neural network.

8. Residual Connections: These are used around each of the two sub-layers (multi-head attention and FFNN) to prevent the vanishing gradient problem. Each sub-layer’s output is added to its input, and this result is then normalized.

9. Encoder and Decoder Blocks: The Transformer has an encoder-decoder structure. The encoder maps an input sequence of symbol representations to a sequence of continuous representations. The decoder generates an output sequence of symbols from these continuous representations. Each of these consists of a stack of identical layers (the number of layers is a hyperparameter). The decoder has an additional multi-head attention layer to attend to the encoder output.

10. Output: The final output of the Transformer is a sequence of vectors, where each vector corresponds to a word in the output sequence. These vectors can then be transformed into a probability distribution of output words using a final linear layer followed by a softmax.

We will dig deeper into some of these components in the next blog posts.

Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a like and sharing it with your friends. Your support means the world to me and helps me to continue creating valuable content for you.

--

--

Isaac Kargar
Isaac Kargar

Written by Isaac Kargar

Co-Founder and Chief AI Officer @ Resoniks | Ph.D. candidate at the Intelligent Robotics Group at Aalto University | https://kargarisaac.github.io/

Responses (1)