techcrunch.com
Have researchers discovered a new AI scaling law? Thats what some buzz on social media suggests but experts are skeptical. AI scaling laws, a bit of an informal concept, describe how the performance of AI models improves as the size of the datasets and computing resources used to train them increases. Until roughly a year ago, scaling up pre-training training ever-larger models on ever-larger datasets was the dominant law by far, at least in the sense that most frontier AI labs embraced it. Pre-training hasnt gone away, but two additional scaling laws, post-training scaling and test-time scaling, have emerged to complement it. Post-training scaling is essentially tuning a models behavior, while test-time scaling entails applying more computing to inference i.e. running models to drive a form of reasoning (see: models like R1). Google and UC Berkeley researchers recently proposed in a paper what some commentators online have described as a fourth law: inference-time search.Inference-time search has a model generate many possible answers to a query in parallel and then select the best of the bunch. The researchers claim it can boost the performance of a year-old model, like Googles Gemini 1.5 Pro, to a level that surpasses OpenAIs o1-preview reasoning model on science and math benchmarks.[B]y just randomly sampling 200 responses and self-verifying, Gemini 1.5 an ancient early 2024 model beats o1-preview and approaches o1, Eric Zhao, a Google doctorate fellow and one of the papers co-authors, wrote in a series of posts on X. The magic is that self-verification naturally becomes easier at scale! Youd expect that picking out a correct solution becomes harder the larger your pool of solutions is, but the opposite is the case!Several experts say that the results arent surprising, however, and that inference-time search may not be useful in many scenarios. Matthew Guzdial, an AI researcher and assistant professor at the University of Alberta, told TechCrunch that the approach works best when theres a good evaluation function in other words, when the best answer to a question can be easily ascertained. But most queries arent that cut-and-dry.[I]f we cant write code to define what we want, we cant use [inference-time] search, he said. For something like general language interaction, we cant do this [] Its generally not a great approach to actually solving most problems.Mike Cook, a research fellow at Kings College London specializing in AI, agreed with Guzdials assessment, adding that it highlights the gap between reasoning in the AI sense of the word and our own thinking processes.[Inference-time search] doesnt elevate the reasoning process of the model, Cook said. [I]ts just a way of us working around the limitations of a technology prone to making very confidently supported mistakes [] Intuitively if your model makes a mistake 5% of the time, then checking 200 attempts at the same problem should make those mistakes easier to spot.That inference-time search may have limitations is sure to be unwelcome news to an AI industry looking to scale up model reasoning compute-efficiently. As the co-authors of the paper note, reasoning models today can rack up thousands of dollars of computing on a single math problem. It seems the search for new scaling techniques will continue.