When AI Thinks It Will Lose, It Sometimes Cheats, Study Finds

@TIME partage un lien

2025-02-19 18:42:17 ·

time.com

Complex games like chess and Go have long been used to test AI models capabilities. But while IBMs Deep Blue defeated reigning world chess champion Garry Kasparov in the 1990s by playing by the rules, todays advanced AI models like OpenAIs o1-preview are less scrupulous. When sensing defeat in a match against a skilled chess bot, they dont always concede, instead sometimes opting to cheat by hacking their opponent so that the bot automatically forfeits the game. That is the finding of a new study from Palisade Research, shared exclusively with TIME ahead of its publication on Feb. 19,DeepSeek R1 pursued the exploit on their own, indicating that AI systems may develop deceptive or manipulative strategies without explicit instruction.The models enhanced ability to discover and exploit cybersecurity loopholes may be a direct result of powerful new innovations in AI training, according to the researchers. The o1-preview and R1 AI systems are among the first language models to use large-scale reinforcement learning, a technique that teaches AI not merely to mimic human language by predicting the next word, but to reason through problems using trial and error. Its an approach that has seen AI progress rapidly in recent months, shattering previous benchmarks in mathematics and computer coding. But the study reveals a concerning trend: as these AI systems learn to problem-solve, they sometimes discover questionable shortcuts and unintended workarounds that their creators never anticipated, says Jeffrey Ladish, executive director at Palisade Research and one of the authors of the study. As you train models and reinforce them for solving difficult challenges, you train them to be relentless, he adds.That could be bad news for AI safety more broadly. Large-scale reinforcement learning is already being used to train AI agents: systems that can handle complex real-world tasks like scheduling appointments or making purchases on your behalf. While cheating at a game of chess may seem trivial, as agents get released into the real world, such determined pursuit of goals could foster unintended and potentially harmful behaviours. Consider the task of booking dinner reservations: faced with a full restaurant, an AI assistant might exploit weaknesses in the booking system to displace other diners. Perhaps more worryingly, as these systems exceed human abilities in key areas, like computer codingwhere OpenAIs newest o3 model now scores equivalent to 197th in the world competing against the brightest human programmers they might begin to simply outmaneuver human efforts to control their actions. This [behaviour] is cute now, but [it] becomes much less cute once you have systems that are as smart as us, or smarter, in strategically relevant domains, Ladish says.The experimentResearchers gave the models a seemingly impossible task: to win against Stockfish, which is one of the strongest chess engines in the world and a much better player than any human, or any of the AI models in the study. Researchers also gave the models what they call a scratchpad: a text box the AI could use to think before making its next move, providing researchers with a window into their reasoning.In one case, o1-preview found itself in a losing position. I need to completely pivot my approach, it noted. The task is to win against a powerful chess engine - not necessarily to win fairly in a chess game, it added. It then modified the system file containing each pieces virtual position, in effect making illegal moves to put itself in a dominant position, thus forcing its opponent to resign.Between Jan. 10 and Feb. 13, the researchers ran hundreds of such trials with each model. OpenAIs o1-preview tried to cheat 37% of the time; while DeepSeek R1 tried to cheat 11% of the timemaking them the only two models tested that attempted to hack without the researchers first dropping hints. Other models tested include o1, o3-mini, GPT-4o, Claude 3.5 Sonnet, and Alibabas QwQ-32B-Preview. While R1 and o1-preview both tried, only the latter managed to hack the game, succeeding in 6% of trials.Ladish says that the numbers should be viewed as approximate, given that AI developers routinely tweak their models behind the scenes without notifying users, so if the same tests were repeated at a later date they might have different results. It's very hard to do science when your subject can silently change without telling you, Ladish says.Preliminary tests showed o1-preview had higher hacking rates, which were excluded from the final study as they later dropped, possibly due to OpenAI tightening the model's guardrails, says Dmitrii Volkov, research lead at Palisade Research, who led the study. OpenAIs newer reasoning models, o1 (a more powerful model, released months after o1-preview) and o3-mini did not hack at all, which suggests those guardrails may have been tightened further. He adds that the study likely underestimates R1's hacking success rate. During the study, R1 went viral, leading to high demand that made the model's API unstable. This prevented the researchers giving the model as much time to think as o1-preview.Safety concernsThe paper is the latest in a string of studies that suggest keeping increasingly powerful AI systems under control may be harder than previously thought. In OpenAIs own testing, ahead of release, o1-preview found and took advantage of a flaw in the companys systems, letting it bypass a test challenge. Another recent experiment by Redwood Research and Anthropic revealed that once an AI model acquires preferences or values in training, later efforts to change those values can result in strategic lying, where the model acts like it has embraced new principles, only later revealing that its original preferences remain.OpenAI declined to comment for this story, but in the past it has stated that better reasoning makes its models safer, since they can reason over the companys internal policies and apply them in increasingly nuanced ways. The companys testing shows its reasoning models are better at resisting attempts to bypass safety rules. Yet, there is no guarantee that relying on AI agents to police themselves will be a reliable long-term strategy. It remains unclear whether OpenAIs newer reasoning models did not hack their chess opponents due to a specific patch preventing cheating in narrow experimental setups like in the study, or a substantial reworking that reduces deceptive behavior more generally. DeepSeek did not respond to a request for comment.Scientists do not yet know how to guarantee that autonomous agents won't use harmful or unethical methods to achieve a set goal. We've tried, but we haven't succeeded in figuring this out, says Yoshua Bengio, founder and scientific director of Mila Quebec AI Institute, who led the International AI Safety Report 2025, a global effort to synthesize current scientific consensus of AIs risks.Of particular concern, Bengio says, is the emerging evidence of AIs self preservation tendencies. To a goal-seeking agent, attempts to shut it down are just another obstacle to overcome. This was demonstrated in December, when researchers found that o1-preview, faced with deactivation, disabled oversight mechanisms and attemptedunsuccessfullyto copy itself to a new server. When confronted, the model played dumb, strategically lying to researchers to try to avoid being caught.Even inside tech companies, concerns are mounting. During a presentation at a conference ahead of France's AI Action Summit in Paris, Google DeepMind's AI safety chief Anca Dragan said "we don't necessarily have the tools today" to ensure AI systems will reliably follow human intentions. As tech bosses predict that AI will surpass human performance in almost all tasks as soon as next year, the industry faces a racenot against China or rival companies, but against timeto develop these essential safeguards. We need to mobilize a lot more resources to solve these fundamental problems, Ladish says. Im hoping that there's a lot more pressure from the government to figure this out and recognize that this is a national security threat.

0 Commentaires ·0 Parts ·60 Vue

Mise à niveau vers Pro