At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 2

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

3 min readFeb 14, 2024

After the previous paper, we talked about above, now it’s time to have MoEs in the Transformer architecture. The paper “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding” addresses the challenges and solutions of scaling machine learning models. We’ve seen dramatic improvements in model quality from scaling neural networks but there are many practical challenges too. These include the lack of efficient model parallelism support, super-linear scaling of computation cost versus model size, difficulties in infrastructure scalability, and the complexity of implementing partitioning strategies.

To overcome these challenges, the paper proposes a model with 600 billion parameters featuring Sparsely-Gated Mixture-of-Experts layers. This approach achieves sub-linear computation cost and constant compilation time, trained on 2048 TPU v3 devices for 4 days. The key design principles include:
1. Sub-linear Scaling: Designing model architecture to keep computation and communication requirements sublinear in model capacity.
2. The Power of Abstraction: Separating model description from partitioning implementation and optimization.
3. Scalable Compilers: Ensuring the system infrastructure, including computation representation and compilation, scales with thousands of devices for parallel execution.

In the proposed sparse scaling of the Transformer architecture, every other feed-forward layer in both the encoder and decoder is replaced with a Position-wise Mixture of Experts (MoE) layers. This layer consists of multiple feed-forward networks, each acting as an ‘expert.’ This design allows for sublinear scaling of computation costs, as each training example activates only a subset of the network, max two, independent of the total number of experts. As the number of experts scaled, it is not possible to fit it in one device, and needs to be scaled across multiple devices. That’s why MoE layers are distributed for parallel processing on multiple devices, while other layers like self-attention are replicated to maintain coherence and efficiency. Check the above figure for more details.

A gating mechanism decides which expert(s) process each input token. The MoE layer aims to achieve two primary goals:

Balanced Load: It’s essential for the MoE layer to activate experts sparsely for each token. This design ensures a more even distribution of the processing load across all experts, avoiding overburdening a few while underutilizing others.
Efficiency at Scale: The gating function must be implemented efficiently, especially in parallel, to handle large-scale computation without leaving computational resources idle.

The gating function, GATE(.), is critical in this setup, deciding the weight of each expert in processing incoming tokens. It uses a softmax activation function. The paper also introduces mechanisms like expert capacity enforcement, local group dispatching, auxiliary loss, and random routing to improve the efficiency and balance of the MoE layer.

This design allows for sub-linear scaling of computation cost, which is crucial for handling the enormous size of state-of-the-art models. The MoE layer is part of a broader effort to scale neural machine translation models to unprecedented sizes while maintaining practical training times and computational efficiency.

Please check the paper for more details on the sharding algorithms.

Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a like and sharing it with your friends. Your support means the world to me and helps me to continue creating valuable content for you.

At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 2

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Written by Isaac Kargar

No responses yet