www.forbes.com
"Once you stop learning, you start dying." Albert EinsteinSource: WikipediaThe pace of improvement in artificial intelligence today is breathtaking.An exciting new paradigmreasoning models based on inference-time computehas emerged in recent months, unlocking a whole new horizon for AI capabilities.The feeling of a building crescendo is in the air. AGI seems to be on everyones lips.Systems that start to point to AGI are coming into view, wrote OpenAI CEO Sam Altman last month. The economic growth in front of us looks astonishing, and we can now imagine a world where we cure all diseases and can fully realize our creative potential.Or, as Anthropic CEO Dario Amodei put it recently: What I've seen inside Anthropic and out over the last few months has led me to believe that we're on track for human-level AI systems that surpass humans in every task within 23 years.Yet todays AI continues to lack one basic capability that any intelligent system should have.Many industry participants do not even recognize that this shortcoming exists, because the current approach to building AI systems has become so universal and entrenched. But until it is addressed, true human-level AI will remain elusive.What is this missing capability? The ability to continue learning.What do we mean by this?Todays AI systems go through two distinct phases: training and inference.First, during training, an AI model is shown a bunch of data from which it learns about the world. Then, during inference, the model is put into use: it generates outputs and completes tasks based on what it learned during training.All of an AIs learning happens during the training phase. After training is complete, the AI models weights become static. Though the AI is exposed to all sorts of new data and experiences once it is deployed in the world, it does not learn from this new data.In order for an AI model to gain new knowledge, it typically must be trained again from scratch. In the case of todays most powerful AI models, each new training run can take months and cost hundreds of millions of dollars.Take a moment to reflect on how peculiarand suboptimalthis is. Todays AI systems do not learn as they go. They cannot incorporate new information on the fly in order to continuously improve themselves or adapt to changing circumstances.In this sense, artificial intelligence remains quite unlike, and less capable than, human intelligence. Human cognition is not divided into separate training and inference phases. Rather, humans continuously learn, incorporating new information and understanding in real-time. (One could say that humans are constantly and simultaneously doing both training and inference.)What if we could eliminate the kludgy, rigid distinction in AI between training and inference, enabling AI systems to continuously learn the way that humans do?This basic concept goes by many different names in the AI literature: continual learning, lifelong learning, incremental learning, online learning.It has long been a goal of AI researchersand has long remained out of reach.Another term has emerged recently to describe the same idea: test-time training.As Perplexity CEO Aravind Srinivas said recently: Test-Time Compute is currently just inference with chain of thought. We havent started doing test-time-training - where model updates weights to go figure out new things or ingest a ton of new context, without losing generality and raw IQ. Going to be amazing when that happens.Fundamental research problems remain to be solved before continual learning is ready for primetime. But startups and research labs are making exciting progress on this front as we speak. The advent of continual learning will have profound implications for the world of AI.Workarounds and Half-SolutionsIt is worth noting that a handful of workarounds exist to mitigate AIs current inability to learn continuously. Three in particular are worth mentioning. While each of these can help, none fully solve the problem.The first is model fine-tuning. Once an AI model has been pretrained, it can subsequently be fine-tuned on a smaller amount of new data in order to incrementally update its knowledge base.In principle, fine-tuning a model on an ongoing basis could be one way to enable an AI system to incorporate new learnings as it goes.However, periodically fine-tuning a model is still fundamentally a batch-based rather than a continuous approach; it does not unlock true on-the-fly learning.And while fine-tuning a model is less resource-intensive than pretraining it from scratch, it is still complex, time-consuming and expensive, making it impractical to do too frequently.Perhaps most importantly, fine-tuning only works well if the new data does not stray too far from the original training data. If the data distribution shifts dramaticallyfor instance, if a model is presented with a totally new task or environment that is unlike anything it has encountered beforethen fine-tuning can fall prey to the foundational challenge of catastrophic forgetting (discussed in more detail below).The second workaround is to combine some form of retrieval with some form of external memory: for instance, retrieval-augmented generation (RAG) paired with a dynamically updated vector database.Such AI systems can store new learnings on an ongoing basis in a database that sits outside the model and then pull information from that database when needed. This can be another way for an AI model to continuously incorporate new information.But this approach does not scale well. The more new learnings an AI system accumulates, the more unwieldy it becomes to store and retrieve all of this new information in an efficient way using an external database. Latency, computational cost, retrieval accuracy and system complexity all limit the usefulness of this approach.A final way to mitigate AIs inability to learn continuously is in-context learning.AI models have a remarkable ability to update their behavior and knowledge based on information presented to them in a prompt and included within their current context window. The models weights do not change; rather, the prompt itself is the source of learning. This is referred to as in-context learning. It is in-context learning that, for example, makes possible the practice of prompt engineering.In-context learning is elegant and efficient. It is also, however, ephemeral.As soon as the information is no longer in the context window, the new learnings are gone: for instance, when a different user starts a session with the same AI model, or when the same user starts a new session with the model the next day. Because the models weights have not changed, its new knowledge does not persist over time. This severely limits the usefulness of in-context learning in enabling true continual learning.Moats, Moats, MoatsOne important reason why continual learning represents such a tantalizing possibility: it could create durable moats for the next generation of AI applications.How would this work?Today, OpenAIs GPT-4o is the same model for everyone that uses it. It doesnt change based on its history with you (although ChatGPT, the product, does incorporate some elements of persistent memory).This makes it frictionless for users to switch between OpenAI, Anthropic, Google, DeepSeek and so on. Any of these companys models will give you more or less the same response to a given prompt, whether youve had thousands of previous interactions with it or you are trying it for the first time.Little wonder that the conventional wisdom today is that AI models inevitably commoditize.In a continual learning regime, by constrast, the more a user uses a model, the more personalized the model becomes. As you work with a model day in and day out, the model becomes more tailored to your context, your use cases, your preferences, your environment. Its neurons literally get rewired as it learns about you and about the things that matter to you. It gets to know you.Imagine how much more compelling a personal AI agent would be if it reliably adapted to your particular needs and idiosyncracies in real-time, thereby building an enduring relationship with you.(For a dramatized illustration of what continual learning might look likeand how different this would be from todays AIthink of the Samantha character in the 2013 film Her.)The impact of continual learning will be enormous in both consumer and enterprise settings.A lawyer using a legal AI application will find that, after a few months of using the application, it has a much deeper understanding than it did at the outset about the lawyers roster of clients, how she engages with different colleagues, how she likes to craft legal arguments, when she chooses to push back on clients versus acquiesce to their preferences, and so forth. A recruiter will find that, the more he uses an AI product, the more intuitively it understands which candidates he tends to prioritize, how he likes to conduct screening interviews, how he writes job descriptions, how he engages in compensation negotiations, and so on. Ditto for AI products for accountants, for doctors, for software engineers, for product designers, for salespeople, for writers, and beyond.Continual learning will enable AI to become personalized in a way that it has never been before. This will make AI products sticky in a way that they have never been before.After youve worked with it for a while, your AI model will be very different than someone elses version or the off-the-shelf version of the same model. Its weights will have adapted to you. This will make it painful and inconvenient to switch to a competing product, in the same way that it is painful and inconvenient to replace a well-trained, high-performing employee with someone who is brand new.Venture capitalists like to obsess over moatsdurable sources of competitive advantage for companies.It remains an open question what the most important new moats will be in the era of AI, particularly at the application layer.A long-standing narrative about moats in AI relates to proprietary data. According to this narrative, the more user data an AI product collects, the better and more differentiated the product becomes as it learns from that data, and the deeper the moat therefore gets. This story makes intuitive sense and is widely repeated today.However, the extent to which collecting additional user data has actually led to product differentiation and moats in AI remains limited to dateprecisely because AI systems do not actually learn and adapt continuously based on new data. How much lock-in do you as a user experience today with Perplexity versus ChatGPT versus Claude as a result of user-level personalization in those products?Continual learning will change this. It will, for the first time, unleash AIs full potential to power hyperpersonalized and hypersticky AI products. It will create a whole new kind of moat for the AI era.Continual Learnings Achilles HeelThe potential upsides of continual learning are enormous. It would unlock whole new capabilities and market opportunities for AI.The idea of continual learning is not new. AI researchers have been talking about it for decades.So: why are todays AI systems still not capable of learning continuously?One fundamental obstacle stands in the way of building AI systems that can learn continuouslyan issue known as catastrophic forgetting. Catastrophic forgetting is simple to explain and fiendishly difficult to solve.In a nutshell, catastrophic forgetting refers to neural networks tendency to overwrite and lose old knowledge when they add new knowledge.Concretely, imagine an AI model whose weights have been optimized to complete task A. It is then exposed to new data related to completing task B. The central premise of continual learning is that the models weights can update dynamically in order to learn to solve task B. By updating the weights to complete task B, however, the models ability to complete task A inevitably degrades.Humans do not suffer from catastrophic forgetting. Learning how to drive a car, for instance, does not cause us to forget how to do math. Somehow, the human brain manages to incorporate new learnings on an ongoing basis without sacrificing existing knowledge. As with much relating to the human brain, we dont understand exactly how it does this. For decades, AI researchers have sought to recreate this ability in artificial neural networkswithout much success.The entire field of continual learning can be understood first and foremost as an attempt to solve catastrophic forgetting.The core challenge here is to find the right balance between stability and plasticity. Increasing one inevitably jeopardizes the other. As a neural network becomes more stable and less changeable, it is in less danger of forgetting existing learnings, but it is also less capable of incorporating new learnings. Conversely, a highly plastic neural network may be well positioned to integrate new learnings from new data, but it does so at the expense of the knowledge that its weights had previously encoded.Existing approaches to continual learning can be grouped into three main categories, each of which seeks to address catastrophic forgetting by striking the right balance between stability and plasticity.The first category is known as replay, or rehearsal. The basic idea behind replay-based methods is to hold on to and revisit samples of old data on an ongoing basis while learning from new data, in order to prevent the loss of older learnings.The most straightforward way to accomplish this is to store representative data points from previous tasks in a memory buffer and then to intersperse those old data with new data when learning new things. A more complex alternative is to train a generative model that can produce synthetic data that approximates the old data and then use that models output to replay previous knowledge, without needing to actually store earlier data points.The core shortcoming of replay-based continual learning methods is that they do not scale well (for a similar reason as RAG-based methods, described above). The more data a continual learning system is exposed to over time, the less practicable it is to hold on to and replay all of that previous data in a compact way.The second main approach to continual learning is regularization. Regularization-based methods seek to mitigate catastrophic forgetting by introducing constraints into the learning process that protect existing knowledge: for example, by identifying model weights that are particularly important for existing knowledge and slowing the rate at which those weights can change, while enabling other parts of the neural network to update more freely.Influential algorithms that fall into this category include elastic weight consolidation (out of DeepMind), Synaptic Intelligence (out of Stanford) and Learning Without Forgetting (out of the University of Illinois).Regularization-based methods can work well under certain circumstances. They break down, though, when the environment shifts too dramaticallyi.e., when the new data looks totally unlike the old databecause their learning constraints prevent them from fully adapting. In short: too much stability, not enough plasticity.The third approach to continual learning is architectural.The first two approaches assume a fixed neural network architecture and aim to assimilate new learnings by updating and optimizing one shared set of weights. Architectural methods, by contrast, solve the problem of incremental learning by allocating different components of an AI models architecture to different realms of knowledge. This often includes dynamically growing the neural network by adding new neurons, layers or subnetworks in response to new learnings.One prominent example of an architectural approach to continual learning is Progressive Neural Networks, which came out of DeepMind in 2016.Devoting different parts of a models architecture to different kinds of knowledge helps mitigate catastrophic forgetting because new learnings can be incorporated while leaving existing parameters untouched. A major downside, though, is again scalability: if the neural network grows whenever it adds new knowledge, it will eventually become intractably large and complex.While replay-based, regularization-based and architecture-based approaches to continual learning have all shown some promise over the years, none of these methods work well enough to enable continual learning at any scale in real-world settings today.Making Continual Learning A RealityThe past year, however, has seen an exciting new wave of progress in continual learning. The advent of generative AI and large language models is redefining what is possible in this field. Suddenly, it seems that AI models that can learn and adapt as they go may be around the corner.A few leading AI startups are at the vanguard of this fast-moving field. Two worth highlighting are Writer and Sakana.Writer is an enterprise AI platform with a long list of blue-chip Fortune 500 customers including Prudential, Intuit, Salesforce, Johnson & Johnson, Uber, LOreal and Accenture.Last November, Writer debuted a new AI architecture known as self-evolving models.These models are able to identify and learn new information in real time adapting to changing circumstances without requiring a full retraining cycle, the Writer team wrote. A self-evolving model has the capacity to improve its accuracy over time, learn from user behavior, and deeply embed itself in business knowledge and workflows.How did Writer manage to build AI models that can learn continuously? How do the companys self-evolving models work?As a self-evolving model is exposed to new information, it actively self-reflects in order to identify where it has knowledge gaps. If it makes a mistake or fails a task, it reflects on what went wrong and generates ideas for improvement. It then stores these self-generated insights in a short-term memory pool within each model layer.Storing these learnings within the models individual layers means that the model can instantly access and apply this information as it processes inputs, without needing to pause and query an external source. It also enables the information in the memory pool to directly shape the models attention mechanism, making its responses more accurate and well-informed.And because the memory pools are fixed in size, they avoid the scalability challenges that plague earlier continual learning methods like replay. Rather than growing unmanageably large as the model accumulates more knowledge, these memory pools function like short-term scratchpads that are continuously updated.Then, periodically, when the model determines that it has accumulated enough important learnings in its short-term memory, it autonomously updates its weights using reinforcement learning in order to more permanently consolidate those learnings. Specifically, Writers self-evolving models use a reinforcement learning metholodogy known as group relative policy optimization, or GRPO, made popular by DeepSeek.Self-evolving models tackle catastrophic learning not by simply referencing the past, like replay-based methods, but by building a system that evolves gracefullyreflecting, remembering and adapting without losing its core, said Writer cofounder & CTO Waseem Alshikh. Its not a total departure from continual learnings roots, but its a fresh twist that leverages the latest in LLM self-improvement. This design reflects our belief that the future of AI isnt just about bigger models, but smarter, more adaptive ones. Its efficient and practical, especially for real-world applications where things change fast.Writers self-evolving models are live in deployment with customers today.Another cutting-edge AI startup that is advancing the frontiers of continual learning is Sakana AI. Based in Japan, Sakana is an AI research lab founded by leading AI scientists from Google, including one of the co-inventors of the transformer architecture.In January, Sakana published new research on what it calls self-adaptive AI. Sakanas new methodology, named Transformer (Transformer Squared), enables AI models to dynamically adjust their weights in real-time depending on the task they are presented with.Our research offers a glimpse into a future where AI models are no longer static, wrote the Sakana research team. These systems will scale their compute dynamically at test-time to adapt to the complexity of tasks they encounter, embodying living intelligence capable of continuous change and lifelong learning. We believe self-adaptivity will not only transform AI research but also redefine how we interact with intelligent systems, creating a world where adaptability and intelligence go hand in hand.Overview of Sakana's Transformer architecture.Source: Sakana AITransformer works by first developing task-specific expert vectors within an AI model that are well-suited to handle different topics (e.g., a vector for math, a vector for coding, and so on).At inference time, the system follows a two-step process (hence the name Transformer) to self-adapt depending on the context. First, the system determines in real-time which skills and knowledge (and thus which vectors) are most relevant given the current task. Second, the neural network dynamically amplifies some of its expert vectors and dampens others, modifying its base weights to tailor itself to its current situation.The Transformer method has some thematic overlap with architectural approaches to continual learning, discussed above, as well as with mixture-of-experts (MoE) systems; all of these approaches involve modular subsystems of experts housed within an AI model.Transformer performs impressively on key benchmarks like GSM8K and ARC, besting popular fine-tuning approaches like LoRA by a wide margin while requiring fewer parameters.In the words of Sakana research scientist Yujin Tang, who led this effort: Transformer is a lightweight, modular approach to adaptation. Unlike MoE, where experts emerge without explicit specialization, our method dynamically refines representations with true task-specific expertise. While not full continual learning yet, its a crucial step toward AI that evolves in real-time without catastrophic forgetting.ConclusionTodays AI models are static. Once deployed, they do not change when they are presented with new information. This is a remarkable shortcoming for any intelligent system to have. It represents a profound weakness of artificial intelligence compared to biological intelligence.But this is changingquickly. At the frontiers of AI, researchers are developing new kinds of AI models that can learn and adapt throughout their lifetimes by continuously updating their weights.Whether you call this new paradigm self-evolving AI, or self-adaptive AI, or test-time training, or (the more traditional term) continual learning, it is one of the most exciting and important areas of research in AI today.It is rapidly erasing the conventional divide between training and inference and opening up an entire new vista of capabilities for AI. It is also enabling new sources of moats and defensibility for AI-native startups.Continual learning will upend established assumptions and redefine what is possible in AI. Watch this space.Note: The author is a partner at Radical Ventures, which is an investor in Writer.