Do Chatbots Just Need More Time to Think?
www.scientificamerican.com
January 23, 20256 min readDo Chatbots Just Need More Time to Think?A technique called test-time compute can improve how AI responds to some hard questions, but it comes at a costBy Lauren Leffer edited by Ben Guarino Moor Studio/Getty ImagesTechnology trends almost always prioritize speed, but the latest fad in artificial intelligence involves deliberately slowing chatbots down. Machine-learning researchers and major tech companies, including OpenAI and Google, are shifting focus from ever larger model sizes and training datasets to instead emphasize something called test-time compute.This strategy is often described as giving AI more time to think or reason, though these models work more rigidly than human brains do. Its not as though an AI model is granted new freedoms to mull over a problem. Instead test-time compute introduces structured interventions in which computer systems are built to double-check their work through intermediate calculations or extra algorithms applied to their final responses. Its more akin to making an exam open-book than it is to simply extending a time limit.Another name for the newly popular AI-improvement strategy (which has actually been around for few years) is inference scaling. Inference is the process by which a previously trained AI crunches through new data to perform a freshly prompted task, whether thats generating text or flagging spam e-mails. By allowing additional seconds or minutes to elapse between a users prompt and the programs response, and by providing extra computational power at the programs critical moment of inference, some AI developers have seen a dramatic jump in the accuracy of chatbot answers.On supporting science journalismIf you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.Test-time compute is especially helpful for quantitative questions. The places weve seen the most exciting improvements are things like code and math, says Amanda Bertsch, a fourth-year computer science Ph.D. student at Carnegie Mellon University, where she studies natural language processing. Bertsch explains that test-time compute offers the most benefit when theres an objectively correct response or a measurable way of determining better or worse.OpenAIs recently released o1, its latest publicly available model powering ChatGPT-style bots, is much better at writing computer code and correctly answering math and science queries than its predecessors, the company claims: a recent blog post describes o1 as up to eight times more accurate in responding to prompts used in programming competitions and nearly 40 percent more accurate at answering Ph.D.-level physics, biology and chemistry questions. OpenAI attributes these improvements to test-time compute and related strategies. And a follow-up model called o3still undergoing safety testing and planned for release later this monthis nearly three times as accurate as o1 in responding to certain reasoning questions, says Lindsay McCallum Rmy, a communications officer at OpenAI.Other academic analyses, most released as preprint studies that have not yet been peer-reviewed, have reported similarly impressive results. Test-time compute could improve AI accuracy and its capacity to tackle complex reasoning problems, says Aviral Kumar, an assistant professor of computer science and machine learning at Carnegie Mellon University. Hes excited about his fields shift toward this strategy because it grants machines the same grace we give people when they take an extra beat to tackle tough questions. He thinks this could bring us closer to models with humanlike intelligence.It seems like they all make models a little bit better. And we really don't understand what the relationships are between them. Jacob Andreas, associate professor of computer scienceEven if it doesnt, test-time compute offers a practical alternative to prevailing methods of improving large language models, or LLMs. The costly, brute-force approach of building ever larger models and training them on increasingly massive datasets is now offering diminishing returns. Bertsch says test-time compute has proven its worth in making consistent performance gainswithout inflating already-unwieldy models or forcing developers to scrounge additional high-quality data from a dwindling supply. Yet increasing test time cant solve everything; it has its own trade-offs and limits.A Big UmbrellaAI developers have multiple ways to adjust the test-time compute process and thus improve model outputs. Its a really broad set of things, Bertsch says, pretty much anything where youre treating a model like part of a system and building scaffolding around it.The most rudimentary method is something anyone with a computer can do at home: asking a chatbot to produce many responses to a single question. Generating more answers requires more time, which means the inference process takes longer. One way to think about it: the user becomes a layer of human scaffolding that guides the model to the most accurate, or best-suited, answer.Another basic method involves prompting a chatbot to report the intermediate steps it takes to solve a problem. Called chain-of-thought prompting, this strategy was formally outlined in a 2022 preprint paper by Google researchers. Similarly, a user can also simply ask an LLM to double-check or improve an output after it has been generated.Some assessments indicate that chain-of-thought prompting and related self-correction methods improve model outputs, though other research demonstrates that these strategies are unreliableprone to producing the same sorts of hallucinations as other chatbot outputs. To reduce unreliability, many test-time strategies use an external verifieran algorithm trained to grade model outputs, based on preset criteria, and to select the output that offers the best step toward a specific goal.Verifiers can be applied after a model has generated a list of possible responses. When an LLM generates computer code, for example, a verifier could be as simple as a program that runs the code to make sure it works. Other verifiers might guide a model through each juncture of a multistep problem. Some versions of test-time compute combine the logic of these approaches by using verifiers that evaluate a models output in both ways: as a stepwise process, with many possible branching paths, and as a final response. Other systems use verifiers to find errors in a chatbots initial output or chain of thoughtand then give the LLM feedback to correct those problems.Test-time compute is so successful for quantitative problems because all verifiers hinge on the existence of a knowable, correct answer (or at least an objective basis for comparing two options), Bertsch says. The strategy is less effective for improving outputs such as poems or translations, in which ranking is subjective.In a slight departure from all of the above, machine-learning developers can also use the same sorts of algorithms to hone a model during development and training and then apply them during test time.Right now we have all of these different techniques, all of which have in common that you just do extra computation at test time and which share basically no other technical features, says Jacob Andreas, an associate professor of computer science at the Massachusetts Institute of Technology. It seems like they all make models a little bit better. And we really don't understand what the relationships are between them.Shared LimitsAlthough the methods vary, they share the same inherent limitations: slower generation speeds and the potential need for more computational resources, water and energy. Environmental sustainability is already a growing issue for the field.It might take about five seconds for an LLM to answer a single query without any added test-time compute, says Ekin Akyrek, a computer science Ph.D. candidate at M.I.T., who is advised by Andreas. But a method developed by Akyrek, Andreas and their colleagues raises that response time to five minutes. For certain applications and prompts, increasing how long inference takes simply doesnt make sense, says Dilek Hakkani-Tur, a professor of computer science at the University of Illinois Urbana-Champaign. Hakkani-Tur has worked extensively on developing AI conversational agents that speak to users, such as Amazons Alexa. There, speed is of utmost importance, she says. For complicated interactions, a user might not mind a few seconds pause for a bots response. But for a basic back-and-forth, a human might disengage if they must wait for what feels like an unnaturally long time.More time also means more computational effort and money. Having o3 perform a single task could cost OpenAI $17 or more than $1,000, depending on the version of the software that is used, according to estimates from the creator of a popular AI benchmarking test, who was granted early access to the AI. And in cases where a model will be queried millions of times by a large user base, shifting the computational investment from training to inference would make all those prompts quickly add up to a major financial burden and a massive energy suck. Querying an LLM such as ChatGPT already uses an estimated 10 times the power of a Google search. Going from five seconds of computation to five minutes increases in-the-moment energy demand dozens of times over, Akyrek says.But this isnt a definite downside in every case. If boosting test-time compute allows for smaller, models to perform better with less training, or if it eliminates the need to keep building and training more models from the ground up, then the strategy could potentially lessen generative AIs energy consumption in some instances, Hakkani-Tur says. The final balance depends on factors such as the intended use, the frequency with which a model is queried and the question of whether the model is small enough to be run on a local device instead of a distant server stack. The pros and cons need to be carefully computed, she adds. I would look at the bigger picture of how I am going to use a model. That is to say, AI developers should think long and hard before encouraging their creations to do the same.
0 Comments ·0 Shares ·21 Views