WWW.VOX.COM
Its getting harder to measure just how good AI is getting
Toward the end of 2024, I offered a take on all the talk about whether AIs scaling laws were hitting a real-life technical wall. I argued that the question matters less than many think: There are existing AI systems powerful enough to profoundly change our world, and the next few years are going to be defined by progress in AI, whether the scaling laws hold or not. Its always a risky business prognosticating about AI, because you can be proven wrong so fast. Its embarrassing enough as a writer when your predictions for the upcoming year dont pan out. When your predictions for the upcoming week are proven false? Thats pretty bad. But less than a week after I wrote that piece, OpenAIs end-of-year series of releases included their latest large language model (LLM), o3. o3 does not exactly put the lie to claims that the scaling laws that used to define AI progress dont work quite that well anymore going forward, but it definitively puts the lie to the claim that AI progress is hitting a wall. o3 is really, really impressive. In fact, to appreciate how impressive it is were going to have to digress a little into the science of how we measure AI systems.Standardized tests for robotsIf you want to compare two language models, you want to measure the performance of each of them on a set of problems that they havent seen before. Thats harder than it sounds since these models are fed enormous amounts of text as part of training, theyve seen most tests before. So what machine learning researchers do is build benchmarks, tests for AI systems that let us compare them directly to one another and to human performance across a range of tasks: math, programming, reading and interpreting texts, you name it. For a while, we tested AIs on the US Math Olympiad, a mathematics championship, and on physics, biology, and chemistry problems. The problem is that AIs have been improving so fast that they keep making benchmarks worthless. Once an AI performs well enough on a benchmark we say the benchmark is saturated, meaning its no longer usefully distinguishing how capable the AIs are, because all of them get near-perfect scores. 2024 was the year in which benchmark after benchmark for AI capabilities became as saturated as the Pacific Ocean. We used to test AIs against a physics, biology, and chemistry benchmark called GPQA that was so difficult that even PhD students in the corresponding fields would generally score less than 70 percent. But the AIs now perform better than humans with relevant PhDs, so its not a good way to measure further progress. On the Math Olympiad qualifier, too, the models now perform among top humans. A benchmark called the MMLU was meant to measure language understanding with questions across many different domains. The best models have saturated that one, too. A benchmark called ARC-AGI was meant to be really, really difficult and measure general humanlike intelligence but o3 (when tuned for the task) achieves a bombshell 88 percent on it. We can always create more benchmarks. (We are doing so ARC-AGI-2 will be announced soon, and is supposed to be much harder.) But at the rate AIs are progressing, each new benchmark only lasts a few years, at best. And perhaps more importantly for those of us who arent machine learning researchers, benchmarks increasingly have to measure AI performance on tasks that humans couldnt do themselves in order to describe what they are and arent capable of. Yes, AIs still make stupid and annoying mistakes. But if its been six months since you were paying attention, or if youve mostly only playing around with the free versions of language models available online, which are well behind the frontier, you are overestimating how many stupid and annoying mistakes they make, and underestimating how capable they are on hard, intellectually demanding tasks. The invisible wallThis week in Time, Garrison Lovely argued that AI progress didnt hit a wall so much as become invisible, primarily improving by leaps and bounds in ways that people dont pay attention to. (I have never tried to get an AI to solve elite programming or biology or mathematics or physics problems, and wouldnt be able to tell if it was right anyway.)Anyone can tell the difference between a 5-year-old learning arithmetic and a high schooler learning calculus, so the progress between those points looks and feels tangible. Most of us cant really tell the difference between a first-year math undergraduate and the worlds most genius mathematicians, so AIs progress between those points hasnt felt like much.But that progress is in fact a big deal. The way AI is going to truly change our world is by automating an enormous amount of intellectual work that was once done by humans, and three things will drive its ability to do that.One is getting cheaper. o3 gets astonishing results, but it can cost more than $,1000 to think about a hard question and come up with an answer. However, the end-of-year release of Chinas DeepSeek indicated that it might be possible to get high-quality performance very cheaply.The second is improvements in how we interface with it. Everyone I talk to about AI products is confident there are tons of innovation to be achieved in how we interact with AIs, how they check their work, and how we set which AI to use for which task. You could imagine a system where normally a mid-tier chatbot does the work but can internally call in a more expensive model when your question needs it. This is all product work versus sheer technical work, and its what I warned in December would transform our world even if all AI progress halted.And the third is AI systems getting smarter and for all the declarations about hitting walls, it looks like they are still doing that. The newest systems are better at reasoning, better at problem solving, and just generally closer to being experts in a wide range of fields. To some extent we dont even know how smart they are because were still scrambling to figure out how to measure it once we are no longer really able to use tests against human expertise.I think that these are the three defining forces of the next few years thats how important AI is. Like it or not (and I dont really like it, myself; I dont think that this world-changing transition is being handled responsibly at all) none of the three are hitting a wall, and any one of the three would be sufficient to lastingly change the world we live in.A version of this story originally appeared in the Future Perfect newsletter. Sign up here!Youve read 1 article in the last monthHere at Vox, we're unwavering in our commitment to covering the issues that matter most to you threats to democracy, immigration, reproductive rights, the environment, and the rising polarization across this country.Our mission is to provide clear, accessible journalism that empowers you to stay informed and engaged in shaping our world. By becoming a Vox Member, you directly strengthen our ability to deliver in-depth, independent reporting that drives meaningful change.We rely on readers like you join us.Swati SharmaVox Editor-in-ChiefSee More:
0 Commenti 0 condivisioni 14 Views