Life-Long Learning — Part a
In this blog post, I want to write about life-long learning and see what it is actually. One of the resources is the “Stanford CS330 Deep Multi-Task & Meta-Learning” course and mainly lecture 15. The content of this blog post is mainly from the lecture, some parts are from the lecture, some from me, and some may be from an AI writing assistant.
Here are other posts from this series:
Let’s first talk about two kinds of problem statements: Multi-Task Learning and Meta-Learning.
In multitask learning, our goal is to solve a set of tasks. The train tasks and the test set tasks are the same set of tasks ultimately.
Whereas in meta-learning, our goal is to quickly learn a new task after having experience with a set of training tasks.
In many real-world scenarios, tasks do not arrive in a neatly organized batch; rather, they are presented sequentially, one after another. This presents a distinct set of challenges, as it diverges from situations where a large dataset or assortment of tasks is made available all at once. In these cases, it becomes crucial to draw upon prior experience in a more dynamic, sequential manner.
Consider the educational setting as an example. Students learn concepts incrementally over time, following a curriculum. Tasks or learning objectives may become progressively more complex, often building on prior knowledge. This isn’t a context where all information is dumped on learners simultaneously in a massive batch.
Similarly, an image classification system learning from a continuous stream of user-submitted images embodies this challenge. Different users may join the platform and start uploading images at varying times. Here, data gathering is a temporally stretched process, as opposed to acquiring a complete dataset from a set of users all at once.
Yet another example would be a robot navigating diverse environments while aiming to acquire an expanding range of skills. As it encounters new challenges or environments, the robot must adapt and learn on-the-fly, without being limited by the assumption of a static batch of settings to work within.
Virtual assistants also operate in a setting of constant flux. These systems must adapt to assisting different users, each presenting unique tasks at various times. The assistant must be equipped to handle each new situation, irrespective of past experiences.
Lastly, consider the role of a medical assistant in aiding in healthcare decisions. In this scenario, cases emerge sequentially rather than being presented en masse. Like in the examples above, the ability to sequentially integrate new information and adapt accordingly becomes critical.
So, whether it’s in education, image classification, robotics, or healthcare, the need to operate effectively in a sequential task environment is a common and pressing challenge.
Sequential learning refers to the process of learning tasks or information as they come in over time. This concept is related to but different from various other terms like online learning, lifelong learning, continual learning, incremental learning, and streaming data settings. These terms are often used interchangeably, leading to confusion and a lack of clear definitions. Sequential learning is distinct from “sequential data,” where data points have an inherent order, and “sequential decision-making,” where multiple decisions are made in a series over time. In sequential learning, the focus is on learning from new data as it arrives.
Let’s assume we have a problem setting and we want to answer the following three questions:
1- How would you set up an experiment to develop and test an algorithm for that setting?
2- What are the desired or required properties of an algorithm for that particular problem?
3- How would you actually go about evaluating such a system?
By thinking about the example tasks in the image below, students came up with some answers.
Across different examples, the general focus is often on good performance on both past and future tasks. However, in some cases, the concern isn’t really about performance on past tasks but solely on future and later tasks. In terms of model development, in some instances, the aim might be to learn a single model that can essentially perform all tasks. In other cases, the emphasis could be more on adaptability.
For example, in the virtual assistant scenario, different people have different objectives. Some may want an assistant that is very efficient, while others may care more about thoroughness. In such cases, adaptability might be more critical than trying to learn a single model that can handle all tasks in sequence.
Several other factors can vary between problem statements in the realm of task-oriented learning. For instance, tasks could be independent and identically distributed (IID), follow a predictable pattern over time, or be organized in a curriculum. Another variation not explicitly mentioned in earlier examples is an adversarial sequence of tasks. In scenarios such as spam detection, the incoming tasks or data points might be intentionally deceptive, designed to thwart filtering systems.
Additionally, tasks can differ in how they present themselves: either through discrete boundaries, like encountering a new user, or through more continuous shifts over time. While the discussion may not have included numerous examples of continuous shifts, they are common in practice. Changes in public sentiment on certain issues or seasonal variations are examples of such shifts.
Finally, tasks may or may not come with clear delineations or shifts. Knowing or not knowing these boundaries can impact the performance and suitability of different algorithms.
There are also some considerations. Different factors come into play beyond just model performance. Data efficiency, for example, relates to the system’s ability to learn tasks more efficiently over time. Computational resources are another concern; the goal might be to learn tasks using fewer computational resources than before.
Memory limitations also impact system performance. Over time, it might not be possible to store all incoming data, making it important to choose algorithms that don’t require an increase in memory storage as the number of tasks grows. This becomes even more critical when dealing with sensitive data, such as medical records, that may not be legally or ethically storable for long periods.
Other evaluation metrics might include computational resource usage and memory capacity. Additional considerations like data interpretability, fairness, and decision-making speed at test time are also important.
The point of considering all these aspects is to acknowledge that different problem settings have varying priorities and constraints. Therefore, some algorithms may be better suited for certain problem settings than others.
Furthermore, numerous issues arise regarding data and privacy, and these largely depend on the type of model in use. Some machine learning models possess explicit memory mechanisms that effectively store information, while others have high capacity, making it feasible to reverse-engineer the original data. However, there are also instances where extracting the data from the model isn’t possible. Generally, if privacy stands as a major concern, opting for a method that prevents data extraction from the model is advisable.
A substantial body of literature exists on the topic of differential privacy, which outlines techniques for training models in a way that makes data extraction impossible. Additionally, research indicates that when a model is trained on a sequence of data that varies over time, the model may actually forget older data and consequently perform poorly on it.
Let’s now see what can be a general online learning (lifelong learning) problem statement:
The problem of general online or lifelong learning can be formalized in a fairly simple way. Imagine that you have a certain amount of time, and you are progressing through it. At each point in time, you observe an input, such as an image or a patient record. Your task is to predict the label at that specific point in time. After making the prediction, you then get to see the true label. This process repeats over time.
This is a fairly typical kind of formalization of the problem. This doesn’t cover all possible formulations. For example, in some problem statements, you may not be able to observe the label right away. There might be some sort of time delay. Maybe it takes a year to label your data, or maybe a month, or an hour. And so you may have some delay between when you make a prediction for a data point and when you observe the label. It may also be that you don’t label all of your data. And so in that case, you may not be able to observe the labels for all of them. But this is a fairly general statement that’s also pretty simple. And it makes it such that you can tractably — you can think about this in theory. And there’s been a lot of kind of theoretical work that’s looking at this problem statement.
So there are a couple of different variants on this problem setting, as well. So the inputs, you could have an IID setting where x
and y
are coming from this stationary distribution where the distributions aren’t a function of time. But you could also have a setting where these distributions are changing over time, as well. There’s also what’s typically referred to as a streaming setting where you’re not allowed to actually store your data points and you only — you receive a stream. The reason why this is a setting is that you might be receiving so much data that it’s just impossible to store it over time. So for example, if you’re streaming video, for example, and you’re a very large video platform, you may not have time to actually store all of the frames that you’re receiving. So lack of memory is one reason. You could also have a lack of computational resources or privacy considerations. You may not be allowed to store the data, or you want to create a service in which you are promising the user that you don’t store their data.
One other reason that the streaming setting possibly could be interesting is if you’re interested in studying analogs to the brain. We don’t really have hard drives. We can’t just access a hard drive like a computer does. And so there’s some work that tries to understand and study how memory might work in the brain. And so in that sort of setting, oftentimes, you want the model to not have any sort of external storage. It instead needs to somehow store it in its weights.
The streaming setting is true in some cases. And you may also have — maybe you don’t have zero memory, but you have limited memory and you can only store a tiny bit of data. There are also a lot of cases where it is actually quite practical to store a ton of data. Hard drives are getting pretty big these days. And oftentimes, we can actually store data as part of the process.
If you are kind of seeing the data points from a sequence of tasks, and the boundaries of tasks are obvious, then instead of just observing xt
, you can also observe the task identifier that corresponds to that task zt
.
That’s enough for this post. We’ll continue in the next blog post.
Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a like and sharing it with your friends. Your support means the world to me and helps me to continue creating valuable content for you.