Transformer²: Self-Adaptive LLMs

It’s getting exciting 👌
Sakana AI just released Transformer² (“Transformer-squared”), a framework that allows large language models (LLMs) to adapt dynamically to various tasks in real time. Unlike traditional static fine-tuning methods, Transformer² uses a two-step process to adjust its behavior based on task requirements:
1. 𝑇𝑎𝑠𝑘 𝐴𝑛𝑎𝑙𝑦𝑠𝑖𝑠: The model first identifies the type of task (e.g., math, coding, reasoning) using a dispatch system.
2. 𝐷𝑦𝑛𝑎𝑚𝑖𝑐 𝐴𝑑𝑎𝑝𝑡𝑎𝑡𝑖𝑜𝑛: It then mixes task-specific “expert” vectors (z which is a vector of weights for each expert), which are pre-trained using reinforcement learning, to adjust the model’s behavior for optimal performance.
𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐫𝐨𝐜𝐞𝐬𝐬
𝑆𝑖𝑛𝑔𝑢𝑙𝑎𝑟 𝑉𝑎𝑙𝑢𝑒 𝐹𝑖𝑛𝑒-𝑇𝑢𝑛𝑖𝑛𝑔 (𝑆𝑉𝐹):
- SVF decomposes each weight matrix in the LLM into three components (using SVD):
U: Left singular vectors.
V: Right singular vectors.
Σ: Singular values (diagonal matrix).
The training modifies only the singular values (Σ) using learnable vectors z, allowing for precise, targeted adjustments to the model’s weights.
𝑅𝑒𝑖𝑛𝑓𝑜𝑟𝑐𝑒𝑚𝑒𝑛𝑡 𝐿𝑒𝑎𝑟𝑛𝑖𝑛𝑔 (𝑅𝐿):
- The expert vectors (z) are optimized via RL using the REINFORCE algorithm.
- A reward function evaluates task-specific outputs and adjusts the vectors to maximize performance.
- Regularization is applied via a KL divergence penalty to maintain consistency with the base model and prevent overfitting.
𝐈𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐏𝐫𝐨𝐜𝐞𝐬𝐬
𝑇𝑤𝑜-𝑃𝑎𝑠𝑠 𝑀𝑒𝑐ℎ𝑎𝑛𝑖𝑠𝑚:
- First Pass: The model observes the input prompt and identifies the task’s requirements. This involves either prompt engineering, a trained classifier, or a mixture-based approach to select or combine relevant expert vectors.
- Second Pass: Based on the identified task, the system dynamically adjusts the model’s weights using the selected expert vectors and generates the final response.
𝐴𝑑𝑎𝑝𝑡𝑎𝑡𝑖𝑜𝑛 𝑆𝑡𝑟𝑎𝑡𝑒𝑔𝑖𝑒𝑠:
- Prompt-Based: Constructs a specific prompt to classify tasks and select pre-trained expert vectors.
- Classifier-Based: Uses a trained task classifier to identify the most relevant expert vector.
- Mixture-Based: Combines multiple expert vectors dynamically for more complex tasks.
Read more in the following links if you are interested!
Blog:
paper:
https://arxiv.org/abs/2501.06252
code: