At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 5

Mixture-of-Experts with Expert Choice Routing

7 min readFeb 14, 2024

The paper “Mixture-of-Experts with Expert Choice Routing” focuses on improving the efficiency and effectiveness of sparsely-activated Mixture-of-Experts (MoE) models in machine learning.

Previous studies often used a “token choice” routing strategy, where each token selects the best one or two experts. This method typically resulted in uneven expert workloads and underutilization. To address this, sparsely gated networks used auxiliary losses for regularization, but it had limited success. Consequently, expert capacity had to be overestimated significantly (2x–8x) to prevent token loss during buffer overflow. Furthermore, most prior works assigned a fixed number of experts to each token, ignoring the varying importance of tokens.

This paper proposes a new method where instead of tokens selecting the top experts, the experts select the top tokens. This approach ensures better load balancing and allows a variable number of experts for each token. The method is shown to significantly improve training efficiency and performance on benchmark tasks, while also addressing common pitfalls of conventional MoE models.

In this paper, they introduce a novel routing method for Mixture-of-Experts (MoE) named “expert choice”. Contrasting with traditional MoE where tokens choose experts, their approach allows experts to select top-k tokens. This ensures perfect load balancing and a flexible number of experts per token. It greatly enhances training efficiency and improves performance, as evidenced by our experiments. Key contributions include: resolving load imbalance issues in traditional MoE without auxiliary loss, achieving over 2× faster training in larger scale models, demonstrating scalability with increasing experts, and outperforming comparable dense models in several downstream tasks.

In Expert Choice (EC) routing, expert capacity is determined by multiplying the average tokens per expert in a batch by a capacity factor, setting the average number of experts per token. The method employs a token-to-expert score matrix to guide routing, showing the likelihood of each token being routed to specific experts. This approach, implemented in the dense feedforward layer of Transformer networks, uses a top-k function for experts to select relevant tokens, followed by a permutation function organizing the hidden values. This efficient distribution across experts eliminates the need for overprovisioning capacity, reducing training and inference time by about 20% compared to previous models like GLaM.

In simple words, in token choice routing, each token independently chooses the best expert(s) based on its own criteria. For example, if several tokens arrive at the gating mechanism, each token evaluates and selects the most suitable expert(s) for processing, possibly leading to some experts being overwhelmed and others underutilized.

In contrast, expert choice routing flips this approach. Here, each expert chooses the tokens it will process. So, when tokens arrive at the gating mechanism, instead of the tokens making the choice, each expert selects from the available tokens based on their capacity and specialty. This method aims to balance the workload more effectively across experts, preventing overload on certain experts and ensuring better utilization of all expert resources.

From Sparse to Soft Mixtures of Experts

Sparse mixture of expert (MoE) architectures are known to expand model capacities efficiently without significant increases in training or inference costs. However, these architectures encounter several challenges, including training instability, token dropping, scaling limitations, and ineffective fine-tuning. In response to these issues, researchers have developed Soft MoE, a fully differentiable sparse Transformer that retains the advantages of traditional MoEs while addressing their shortcomings. Soft MoE implements a soft assignment mechanism, distributing various weighted combinations of input tokens across different experts. Similar to other MoE models, Soft MoE’s experts process only a subset of these combined tokens, facilitating larger model capacities at reduced inference costs. In the realm of visual recognition, Soft MoE significantly outperforms standard Transformers (ViTs) and popular MoE variants like Tokens Choice and Experts Choice. For example, Soft MoE-Base/16 is shown to reduce inference costs by 10.5× (and wall-clock time by 5.7×) compared to ViT-Huge/14, while maintaining comparable performance after a similar training period. Moreover, Soft MoE exhibits impressive scalability. For instance, Soft MoE Huge/14 with 128 experts in 16 MoE layers holds over 40× more parameters than ViT Huge/14, yet the increase in inference time cost is only by 2%, while its performance is significantly better.

The main differences between Sparse MoE layers and Soft MoE layers lie in their approach to token assignment and the resulting implications for optimization and implementation.

In Sparse MoE layers, the routing mechanism is designed to assign individual input tokens to each available slot. This discrete assignment process requires the router to learn specific token-to-slot allocations. While effective, this approach introduces several optimization and implementation challenges, such as ensuring a balanced load across experts and managing the computational complexity that arises from handling discrete assignments.

On the other hand, Soft MoE layers adopt a different strategy. In these layers, each slot is formed as a weighted average of all input tokens, rather than assigning individual tokens to specific slots. This method allows for a more fluid and flexible distribution of information across the experts. By using weighted averages, Soft MoE layers avoid the optimization and implementation issues associated with learning discrete assignments. This approach simplifies the routing process, as it sidesteps the need for the router to make hard decisions about which token goes to which slot, thereby reducing the complexity and potential bottlenecks in the model’s architecture. The following figure shows the routing algorithm in a more detailed way:

The Soft MoE routing algorithm operates in a unique and multi-step manner to distribute input tokens to various experts for processing. Here’s an explanation of this process:

Computing Logits: Initially, Soft MoE calculates scores, also known as logits, for every combination of input token and slot. This computation is based on learnable parameters that are specific to each slot.
Normalization and Combination: The logits for each slot (column-wise) are normalized. Following this normalization, each slot computes a linear combination of all input tokens using these normalized weights, depicted in green in the diagram. This step essentially mixes the information from all input tokens into each slot, based on how relevant each token is to that slot.
Expert Processing: Each expert, which is a multi-layer perceptron (MLP) in this context, then processes its assigned slots. For instance, in the provided diagram, each expert handles 2 slots.
Final Combination: The same logits originally computed are now normalized per token (row-wise normalization). These normalized values are used to combine the outputs of all slots for each input token, as indicated in blue in the diagram.
Learnable Parameters: The process involves learnable parameters (shown as dashed boxes in the diagram), which the algorithm adjusts during training to optimize the routing and processing of input tokens.

Here are some properties of Soft MoEs and connections with Sparse MoEs:

Fully Differentiable Algorithm: Soft MoE layers differ from Sparse MoE in their assignment process. While Sparse MoE employs discrete, non-differentiable methods like Top-k or “Token Choice” and “Expert Choice” routers, Soft MoE layers use continuous, fully differentiable operations. Soft MoE’s approach involves soft assignments with softmax scores, in contrast to the hard assignments in Sparse MoE.
No Token Dropping and Expert Unbalance: Classical Sparse MoE routers often face issues like token dropping and expert unbalance, impacting performance. However, Soft MoE is immune to these problems as each slot contains a weighted average of all tokens, ensuring all tokens are utilized and evenly distributed among experts.
Fast and Efficient: The efficiency of Soft MoE is determined by the total number of slots, not the number of experts. This setup, free from sorting or top-k operations, makes Soft MoE faster and better suited for hardware accelerators compared to most sparse MoEs.
Combining Features of Sparse and Dense Models: Unlike Sparse MoEs that apply expert parameters to a subset of input tokens, Soft MoEs involve every token in each slot, making it technically non-sparse. However, it’s not a Dense MoE either, as each expert only processes a subset of slots, not all input tokens.
Per-Sequence Determinism: Sparse MoE models, constrained by capacity, tend to lose sequence-level determinism due to grouping tokens from different sequences, which may compete for expert attention. Soft MoE avoids this by setting the group size to a single sequence, ensuring each expert handles tokens from every input, thereby maintaining per-example determinism and speed.

Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a like and sharing it with your friends. Your support means the world to me and helps me to continue creating valuable content for you.

At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 5

Mixture-of-Experts with Expert Choice Routing

From Sparse to Soft Mixtures of Experts

Written by Isaac Kargar

No responses yet