Expert Gate: Lifelong Learning with a Network of Experts
Note: AI tools are used in writing this blog post!
The paper “Expert Gate: Lifelong Learning with a Network of Experts” focuses on solving the problem of lifelong learning in deep learning models. The authors introduce a model that sequentially adds new tasks or experts, building on previously learned knowledge, without the need to retain data from previous tasks. A critical challenge addressed is the selection of the appropriate expert for a given test sample. The solution proposed is a set of gating autoencoders that learn to represent and automatically forward test samples to the relevant expert. These autoencoders try to get and reconstruct each new sample. Then the autoencoder with the lowest reconstruction error is the one that basically knows the sample better than others and its expert model will be selected. This approach enhances memory efficiency and enables the model to capture task relatedness, which informs the most relevant prior model (the one with the lowest reconstruction errors on the new data from the new task) to be used as the pre-trained model for training or fine-tuning for the new task.
In the world of deep learning and big data, people often train complex models with large amounts of data. Each model is specialized for a specific task and trained on its own dataset. When learning a new task, such as scene classification, the model needs to be adapted (like changing the head for a new number of classes or adapted to the new task based on what it wants to do) and fine-tuned with new data. This process can lead to a problem called “catastrophic forgetting,” where the model performs well on the new task but poorly on previous tasks.
Ideally, a system should work well across various tasks and domains. One approach is to train a model on all tasks together if all data is available, and if a new task comes in with its data, new layers or neurons will be added and the network will be trained again on all tasks. However, this method has several drawbacks.
- First, tasks that are not related or are opposed to each other can negatively impact the model.
- Second, the model might not capture specific details needed for certain tasks as joint training will encourage a hidden representation beneficial for all tasks.
- Third, every time a new task is added, the entire model must be retrained.
- In addition, the biggest challenge with this approach is the need to keep all the training data from previous tasks, which is difficult with the vast amount of data available today.
Instead of trying to make one model that does everything and is not an expert in none of them, the authors suggest having separate specialist models for different tasks. They propose a “Network of Experts,” where a new expert model is added for each new task, transferring knowledge from previous models.
As the number of specialized tasks grows, so does the number of expert models. But there’s a challenge with modern GPUs, which have limited memory and can only load a few models at a time. To solve this, they developed a gating mechanism that selects the appropriate expert model based on the test sample. This method, called “Expert Gate,” is different from other approaches that train one large network for various tasks. For instance, a drone might need different models for flying in different environments, and the gating mechanism can choose the right model based on the input video.
Choosing the correct expert model isn’t easy, as neural networks can sometimes give misleadingly high confidence scores. They ruled out training a classifier to distinguish between tasks because it would require storing all the training data. What’s needed is a way to recognize the relevance of a task model for a specific test sample, which their gating mechanism does. This approach is similar to how the prefrontal cortex in the primate brain uses neural representations of task context to control different brain functions.
The authors suggest a new method for recognizing tasks using an autoencoder as a gate. For each new task or domain, they create a gating function. This function understands the common features in the training samples and can identify similar samples when tested. They do this with a one-layer undercomplete autoencoder, which is trained alongside the expert model for each task. This autoencoder compresses the training data into a smaller space.
The main hypothesis is that the autoencoder of one domain/task should thus be better at reconstructing the data of that task than the other autoencoders.
When testing, the autoencoder projects the sample into this smaller space and measures how much information is lost in this process. The autoencoder with the least loss of information (reconstruction error) acts like a switch, choosing the corresponding expert model for that task.
These autoencoders can also help figure out how related different tasks are during training. This information helps the system decide which previous model to use when learning a new task. It also determines whether to adjust the model slightly or use a technique called learning without forgetting.
In summary, their contributions are as follows: They developed Expert Gate, a system for lifelong learning that handles new tasks sequentially without needing all the previous data. It automatically picks the most related previous task to help learn the new one. When testing, it loads the right model for the task at hand. They tested this system on image classification and video prediction problems.
Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a like and sharing it with your friends. Your support means the world to me and helps me to continue creating valuable content for you.