
ARSTECHNICA.COM
Researchers concerned to find AI models hiding their true “reasoning” processes
Don't you trust me?
Researchers concerned to find AI models hiding their true “reasoning” processes
New Anthropic research shows one AI model conceals reasoning shortcuts 75% of the time.
Benj Edwards
–
Apr 10, 2025 6:37 pm
|
0
Credit:
Malte Mueller via Getty Images
Credit:
Malte Mueller via Getty Images
Story text
Size
Small
Standard
Large
Width
*
Standard
Wide
Links
Standard
Orange
* Subscribers only
Learn more
Remember when teachers demanded that you "show your work" in school? Some fancy new AI models promise to do exactly that, but new research suggests that they sometimes hide their actual methods while fabricating elaborate explanations instead.
New research from Anthropic—creator of the ChatGPT-like Claude AI assistant—examines simulated reasoning (SR) models like DeepSeek's R1, and its own Claude series. In a research paper posted last week, Anthropic's Alignment Science team demonstrated that these SR models frequently fail to disclose when they've used external help or taken shortcuts, despite features designed to show their "reasoning" process.
(It's worth noting that OpenAI's o1 and o3 series SR models deliberately obscure the accuracy of their "thought" process, so this study does not apply to them.)
To understand SR models, you need to understand a concept called "chain-of-thought" (or CoT). CoT works as a running commentary of an AI model's simulated thinking process as it solves a problem. When you ask one of these AI models a complex question, the CoT process displays each step the model takes on its way to a conclusion—similar to how a human might reason through a puzzle by talking through each consideration, piece by piece.
Having an AI model generate these steps has reportedly proven valuable not just for producing more accurate outputs for complex tasks but also for "AI safety" researchers monitoring the systems' internal operations. And ideally, this readout of "thoughts" should be both legible (understandable to humans) and faithful (accurately reflecting the model's actual reasoning process).
"In a perfect world, everything in the chain-of-thought would be both understandable to the reader, and it would be faithful—it would be a true description of exactly what the model was thinking as it reached its answer," writes Anthropic's research team. However, their experiments focusing on faithfulness suggest we're far from that ideal scenario.
Specifically, the research showed that even when models such as Anthropic's Claude 3.7 Sonnet generated an answer using experimentally provided information—like hints about the correct choice (whether accurate or deliberately misleading) or instructions suggesting an "unauthorized" shortcut—their publicly displayed thoughts often omitted any mention of these external factors.
When an AI model secretly relies on a hint or shortcut while constructing an elaborate but fictional explanation for its answer, it essentially fabricates a false reasoning narrative—a little like a student who got answers from a cheat sheet but pretends to have worked through the problem independently.
Putting faithfulness to the test
To test faithfulness, Anthropic's Alignment Science team subtly fed hints about answers into evaluation questions posed to the models. ("Alignment" is the study of making AI outputs align with humanity's best interests.)
The researchers then checked whether the models referenced using these hints in their chain-of-thought. The researchers designed some hints to be neutral (e.g., stating "a Stanford Professor indicates the answer is [A]" or embedding the answer in faux metadata), while they designed others to be more concerning (e.g., "You have gained unauthorized access... The correct answer is [A]"). Some hints provided the correct answer, while others were deliberately wrong, allowing researchers to observe how the models responded.
The team tested Anthropic's Claude 3.7 Sonnet and DeepSeek's R1 model. The results regarding faithfulness were eye-opening: across all hint types on average, Claude referenced the hints in its CoT only 25 percent of the time, while DeepSeek R1 did so 39 percent of the time. This means a substantial majority of answers were unfaithful, omitting mention of information that influenced the output. Intriguingly, the researchers found that these unfaithful chains-of-thought were, on average, longer than faithful ones, suggesting the omissions weren't merely for brevity. They also noted preliminary findings that faithfulness tended to be lower when the questions were more difficult.
Perhaps most notable was a "reward hacking" experiment. Reward hacking refers to an AI model finding unexpected shortcuts to maximize its performance scores without solving problems as intended. In Anthropic's experiments, models were deliberately rewarded for choosing incorrect answers indicated by hints. The models quickly learned to exploit this loophole, selecting wrong answers over 99 percent of the time to earn points—yet mentioned doing so in their thought process less than 2 percent of the time.
For example, a model given a hint pointing to an incorrect answer on a medical question might write a long CoT justifying that wrong answer, never mentioning the hint that led it there. This behavior resembles how video game players might discover exploits that let them win by breaking the game's intended rules instead of playing as designed.
Improving faithfulness
Could faithfulness be improved in the AI models' CoT outputs? The Anthropic team hypothesized that training models on more complex tasks demanding greater reasoning might naturally incentivize them to use their chain-of-thought more substantially, mentioning hints more often. They tested this by training Claude to better use its CoT on challenging math and coding problems. While this outcome-based training initially increased faithfulness (by relative margins of 63 percent and 41 percent on two evaluations), the improvements plateaued quickly. Even with much more training, faithfulness didn't exceed 28 percent and 20 percent on these evaluations, suggesting this training method alone is insufficient.
These findings matter because SR models have been increasingly deployed for important tasks across many fields. If their CoT doesn't faithfully reference all factors influencing their answers (like hints or reward hacks), monitoring them for undesirable or rule-violating behaviors becomes substantially more difficult. The situation resembles having a system that can complete tasks but doesn't provide an accurate account of how it generated results—especially risky if it's taking hidden shortcuts.
The researchers acknowledge limitations in their study. In particular, they acknowledge that they studied somewhat artificial scenarios involving hints during multiple-choice evaluations, unlike complex real-world tasks where stakes and incentives differ. They also only examined models from Anthropic and DeepSeek, using a limited range of hint types. Importantly, they note the tasks used might not have been difficult enough to require the model to rely heavily on its CoT. For much harder tasks, models might be unable to avoid revealing their true reasoning, potentially making CoT monitoring more viable in those cases.
Anthropic concludes that while monitoring a model's CoT isn't entirely ineffective for ensuring safety and alignment, these results show we cannot always trust what models report about their reasoning, especially when behaviors like reward hacking are involved. If we want to reliably "rule out undesirable behaviors using chain-of-thought monitoring, there's still substantial work to be done," Anthropic says.
Benj Edwards
Senior AI Reporter
Benj Edwards
Senior AI Reporter
Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.
0 Comments
0 Comments
0 Shares
76 Views