At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 1

5 min readDec 13, 2023

Note: AI tools are used in this blog post!

The ability of a neural network to learn and store information depends on its size, specifically the number of parameters it has. In deep learning research, there’s a growing trend to find better methods to enlarge the number of these parameters. One such method is the Mixture-of-experts (MoE) approach. This approach involves a type of selective processing where only certain parts of the network are used for each specific example. The advantage of MoE is that it allows for a significant expansion of the neural network’s capacity to learn and process information without needing a similar increase in computational power. It is kind of similar to Ensemble of Models, but there we have several models trained separately and then we somehow mix their outputs. There is no joint training and learnable gating system there.

In this series of blog posts, I review some interesting papers related to the topic of Mixture of Experts.

Here are other posts from this series:

At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 2

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

kargarisaac.medium.com

At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 3

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

kargarisaac.medium.com

At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 4

Scaling Vision with Sparse Mixture of Experts

kargarisaac.medium.com

At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 5

Mixture-of-Experts with Expert Choice Routing

kargarisaac.medium.com

At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 6

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

kargarisaac.medium.com

Let’s dive in!

Hierarchical mixtures of experts and the EM algorithm

The paper “Hierarchical Mixtures of Experts and the EM Algorithm” by Michael I. Jordan and Robert A. Jacobs presents the first mixture of experts architecture I’ve found. It presents a tree-structured architecture for supervised learning and mixture of experts. It introduces a hierarchical mixture model where both mixture coefficients and components are generalized linear models (GLIMs). The learning process is framed as a maximum likelihood problem, solved using an Expectation-Maximization (EM) algorithm. This includes an online learning algorithm where parameters are updated incrementally. The model’s effectiveness is demonstrated through simulations in robot dynamics, showing its efficiency and accuracy compared to traditional algorithms. The paper emphasizes the HME model’s advantages, including its statistical framework, versatility in handling various data types, and potential for Bayesian approaches.

In the Hierarchical Mixtures of Experts (HME) model, the expert networks and gating mechanisms are central components.

Expert Networks: These are generalized linear models (GLIMs) that operate at the nodes of the hierarchical tree. Each expert network specializes in modeling a specific part of the data space. In essence, they act as local models that provide predictions for instances that fall within their domain.
Gating Mechanism: The gating networks are also GLIMs, but their role is to determine the allocation of data points to different experts. They act like ‘gates’ at each node of the tree, deciding which expert network should handle a given data point based on the characteristics of that data. The gating networks essentially learn which part of the data space each expert is best suited to model.

Together, these two mechanisms allow the model to adaptively partition the data space and apply the most appropriate local model (expert) to each region, thereby enabling more accurate and flexible modeling compared to traditional, monolithic approaches.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Many years after the above paper, Google published a new paper, “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” which introduces a novel neural network component, the Sparsely-Gated Mixture-of-Experts Layer (MoE), which dramatically increases model capacity without a proportional increase in computation. With this layer, you can have a much larger model with more capacity but with the same computation need for training and inference.

This is achieved by having up to thousands of feed-forward sub-networks, where a trainable gating network selects a sparse combination for each example. The paper demonstrates the application of this approach to language modeling and machine translation, achieving significant improvements over state-of-the-art models with lower computational costs. The MoE is particularly effective for large datasets, enabling the training of models with up to 137 billion parameters.

The Mixture-of-Experts (MoE) layer is composed of numerous expert networks (E1 to En) and a gating network (G), producing a sparse vector. These expert networks are feed-forward neural networks with identical structures but different parameters. Each input goes to one or multiple experts based on the decision of the gating system and the module’s output, for input x, is calculated by summing the products of the gating network’s output and each expert network’s output. The computational efficiency is enhanced by the sparsity of G(x)’s output, as it allows skipping certain expert networks. The paper also discusses hierarchical MoEs and their similarities to other conditional computation models.

The Gating Network can be of two types:

Softmax Gating, which multiplies the input by a weight matrix and applies the Softmax function,
and Noisy Top-K Gating, which adds sparsity and noise for improved efficiency and load balancing. The Noisy Top-K approach uses Gaussian noise and retains only the top k values, ensuring sparsity.

Training is achieved through back-propagation, and the approach varies from others by allowing gradient backpropagation through the gating network.

Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a like and sharing it with your friends. Your support means the world to me and helps me to continue creating valuable content for you.

At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 1

At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 2

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 3

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 4

Scaling Vision with Sparse Mixture of Experts

At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 5

Mixture-of-Experts with Expert Choice Routing

At the Frontier of AI: Reviewing Top Papers on Mixture of Experts in Machine Learning — Part 6

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Hierarchical mixtures of experts and the EM algorithm

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Written by Isaac Kargar

No responses yet