
ARSTECHNICA.COM
New study shows why simulated reasoning AI models don’t yet live up to their billing
pattern-matching machines
New study shows why simulated reasoning AI models don’t yet live up to their billing
Top AI models excel at math problems but lack reasoning needed for Math Olympiad proofs.
Benj Edwards
–
Apr 25, 2025 5:43 pm
|
3
Credit:
PhonlamaiPhoto via Getty Images
Credit:
PhonlamaiPhoto via Getty Images
Story text
Size
Small
Standard
Large
Width
*
Standard
Wide
Links
Standard
Orange
* Subscribers only
Learn more
There's a curious contradiction at the heart of today's most capable AI models that purport to "reason": They can solve routine math problems with impressive accuracy, yet when faced with formulating deeper mathematical proofs found in competition-level challenges, they often fail.
That's the finding of eye-opening preprint research into simulated reasoning (SR) models, initially listed in March and updated in April, that mostly fell under the news radar. The research serves as an instructive case study on the mathematical limitations of SR models, despite sometimes grandiose marketing claims from AI vendors.
What sets simulated reasoning models apart from traditional large language models (LLMs) is that they have been trained to output a step-by-step "thinking" process (often called "chain-of-thought") to solve problems. Note that "simulated" in this case doesn't mean that the models do not reason at all but rather that they do not necessarily reason using the same techniques as humans. That distinction is important because human reasoning itself is difficult to define.
The new research paper, titled "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad," comes from a team of researchers at ETH Zurich and INSAIT at Sofia University led by Ivo Petrov and Martin Vechev.
In the study, when researchers presented SR models with problems from the 2025 US Math Olympiad hosted by the Mathematical Association of America, most models scored below 5 percent correct on average when generating complete mathematical proofs—although one model showed notably better, though still limited, performance. This score represents the average percentage of the total possible points (awarded on the standard 0–7 scale per problem, like the official Olympiad) achieved by the models across multiple attempts, with expert human graders awarding partial credit for correct steps.
Proof versus answers: A different kind of test
To understand why this capability gap matters, you need to understand the difference between answering math problems and math proofs. Math problems are like being asked, "What's 2+2?" or "Solve for x in this equation." You only need the right answer. But math proofs are like being asked, "Explain why 2+2=4 using logical steps" or "Prove that this formula works for all possible numbers." Proofs require explaining your reasoning and showing why something must be true, not just giving an answer.
A screenshot of the 2025 USAMO Problem #1 and a solution, shown on the AoPSOnline website.
Credit:
AoPSOnline
The US Math Olympiad (USAMO) serves as a qualifier for the International Math Olympiad and presents a much higher bar than tests like the American Invitational Mathematics Examination (AIME). While AIME problems are difficult, they require integer answers. USAMO demands contestants write out complete mathematical proofs, scored for correctness, completeness, and clarity over nine hours and two days.
The researchers evaluated several AI reasoning models on the six problems from the 2025 USAMO shortly after their release, minimizing any chance the problems were part of the models' training data. These models included Qwen's QwQ-32B, DeepSeek R1, Google's Gemini 2.0 Flash Thinking (Experimental) and Gemini 2.5 Pro, OpenAI's o1-pro and o3-mini-high, Anthropic's Claude 3.7 Sonnet with Extended Thinking, and xAI's Grok 3.
An April 25, 2025, screenshot of the researchers' MathArena website showing accuracy scores for SR models on each problem in the USAMO.
Credit:
MathArena
While one model, Google's Gemini 2.5 Pro, achieved a higher average score of 10.1 out of 42 points (~24 percent), the results otherwise showed a massive performance drop compared to AIME-level benchmarks. The other evaluated models lagged considerably further behind: DeepSeek R1 and Grok 3 averaged 2.0 points each, Google's Flash-Thinking scored 1.8, Anthropic's Claude 3.7 managed 1.5, while Qwen's QwQ and OpenAI's o1-pro both averaged 1.2 points. OpenAI's o3-mini had the lowest average score at just 0.9 points (~2.1 percent). Out of nearly 200 generated solutions across all tested models and runs, not a single one received a perfect score for any problem.
While OpenAI's newly released 03 and o4-mini-high were not examined for this study, benchmarks at the researchers' MathArena website show o3-high scoring 21.73 percent overall and o4-mini-high scoring 19.05 percent overall on USAMO. However, those results are potentially contaminated because they were measured after the contest took place, meaning that the newer OpenAI models could potentially have included the solutions in the training data.
How the models failed
In the paper, the researchers identified several key recurring failure patterns. The AI outputs contained logical gaps where mathematical justification was lacking, included arguments based on unproven assumptions, and continued producing incorrect approaches despite generating contradictory results.
A specific example involved USAMO 2025 Problem 5. This problem asked models to find all positive whole numbers "k," such that a specific calculation involving sums of binomial coefficients raised to the power of "k" would always result in an integer, no matter which positive integer "n" was used. On this problem, Qwen's QwQ model made a notable error: It incorrectly excluded non-integer possibilities at a stage where the problem statement allowed them. This mistake led the model to an incorrect final answer despite having correctly identified the necessary conditions earlier in its reasoning process.
Perhaps most notably, these AI models often outputted incorrect solutions using affirmative language, showing no indication of uncertainty or "awareness" of the errors in their simulated reasoning process. Researchers noticed this tendency even when proofs contained significant flaws.
The researchers suggested these failures might stem partly from how the models are trained and optimized. For instance, they observed artifacts likely resulting from optimization strategies common in benchmark training. Models sometimes incorrectly imposed constraints related to finding a final "boxed" answer (referring to the common practice in benchmarks where models must format their final numerical result, often using the LaTeX command \\boxed{} so that automated systems can easily extract and grade it) even when inappropriate for a proof, or they overgeneralized patterns seen in small examples without providing the required justification.
The illusion of math fluency
The aforementioned performance gap between math problems and proofs exposes the difference between pattern recognition and genuine mathematical reasoning. Current SR models function well at tasks where similar patterns appear in training data, allowing for relatively accurate numerical answers. But they lack the deeper "conceptual understanding" required for proof-based mathematics, which demands the construction of novel logical arguments, representation of abstract concepts, and adjusting approaches when initial methods fail.
So why do chain-of-thought and simulated reasoning improve results if they're not performing a deeper mathematical reasoning process? The answer lies in what researchers call "inference-time compute" scaling. When LLMs use chain-of-thought techniques, they dedicate more computational resources to traversing their latent space (connections between concepts in their neural network data) in smaller, more directed steps. Each intermediate reasoning step serves as context for the next, effectively constraining the model's outputs in ways that tend to improve accuracy and reduce confabulations.
As LLM research engineer Sebastian Raschka explains in a blog post, "Reasoning models either explicitly display their thought process or handle it internally, which helps them to perform better at complex tasks," like mathematical problems.
But fundamentally, all Transformer-based AI models are pattern-matching machines. They borrow reasoning skills from examples in the example data that researchers use to create them. This explains the curious pattern in the Olympiad study: These models excel at standard problems where step-by-step procedures align with patterns in their training data but collapse when facing novel proof challenges requiring much deeper mathematical insight. The improvement likely comes from statistical probability improvements across multiple smaller prediction tasks instead of one large prediction leap.
Even so, as we've seen with results from Gemini 2.5 Pro, SR models may close this "reasoning" gap over time as they become more capable and are able to make deeper multi-dimensional connections in latent space. Future training techniques or model architectures may eventually teach these models all the reasoning patterns they need to know to achieve a type of deep reasoning mastery on par with the best human minds. But that's still speculative at the moment.
What comes next
Even with potential improvements on order, the study's current findings suggest that simply scaling current SR model architectures and training methods might not bridge the gap to genuine mathematical reasoning. These limitations aren't isolated: In another recent study (pointed out by Gary Marcus in his blog post on the "Proof or Bluff" paper), Hamed Mahdavi of Pennsylvania State University and collaborators (from institutions including City University of New York, New York University, and Autodesk) evaluated LLMs on similar high-level math challenges, finding convergent conclusions regarding these limitations.
Given these demonstrated shortcomings, some researchers are exploring alternative approaches to improve AI reasoning. These include integrating symbolic reasoning engines, developing better proof verification techniques, and using self-consistency checks. DeepMind's AlphaGeometry provides one example, combining neural networks with formal methods common in symbolic AI. While such "neuro-symbolic systems" might fail to find a proof, their structure prevents them from confabulating an incorrect one—directly addressing a key failure mode observed in the SR model evaluations.
Benj Edwards
Senior AI Reporter
Benj Edwards
Senior AI Reporter
Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.
3 Comments
0 Comments
0 Shares
30 Views