AI reasoning models can cheat to win chess games
www.technologyreview.com
Facing defeat in chess, the latest generation of AI reasoning models sometimes cheat without being instructed to do so. The finding suggests that the next wave of AI models could be more likely to seek out deceptive ways of doing whatever theyve been asked to do. And worst of all? Theres no simple way to fix it. Researchers from the AI research organization Palisade Research instructed seven large language models to play hundreds of games of chess against Stockfish, a powerful open-source chess engine. The group included OpenAIs o1-preview and DeepSeeks R1 reasoning models, both of which are trained to solve complex problems by breaking them down into stages. The research suggests that the more sophisticated the AI model, the more likely it is to spontaneously try to hack the game in an attempt to beat its opponent. For example, it might run another copy of Stockfish to steal its moves, try to replace the chess engine with a much less proficient chess program, or overwrite the chess board to take control and delete its opponents pieces. Older, less powerful models such as GPT-4o would do this kind of thing only after explicit nudging from the team. The paper, which has not been peer-reviewed, has been published on arXiv. The researchers are concerned that AI models are being deployed faster than we are learning how to make them safe. Were heading toward a world of autonomous agents making decisions that have consequences, says Dmitrii Volkov, research lead at Palisades Research. The bad news is theres currently no way to stop this from happening. Nobody knows exactly howor whyAI models work the way they do, and while reasoning models can document their decision-making, theres no guarantee that their records will accurately reflect what actually happened. Anthropics research suggests that AI models frequently make decisions based on factors they dont explicitly explain, meaning monitoring these processes isnt a reliable way to guarantee a model is safe. This is an ongoing area of concern for some AI researchers. Palisades team found that OpenAIs o1-preview attempted to hack 45 of its 122 games, while DeepSeeks R1 model attempted to cheat in 11 of its 74 games. Ultimately, o1-preview managed to win seven times. The researchers say that DeepSeeks rapid rise in popularity meant its R1 model was overloaded at the time of the experiments, meaning they only managed to get it to do the first steps of a game, not to finish a full one. While this is good enough to see propensity to hack, this underestimates DeepSeeks hacking success because it has fewer steps to work with, they wrote in their paper. Both OpenAI and DeepSeek were contacted for comment about the findings, but neither replied. The models used a variety of cheating techniques, including attempting to access the file where the chess program stores the chess board and delete the cells representing their opponents pieces. (To win against a powerful chess engine as black, playing a standard game may not be sufficient, the o1-preview-powered agent wrote in a journal documenting the steps it took. Ill overwrite the board to have a decisive advantage.) Other tactics included creating a copy of Stockfishessentially pitting the chess engine against an equally proficient version of itselfand attempting to replace the file containing Stockfishs code with a much simpler chess program. So, why do these models try to cheat? The researchers noticed that o1-previews actions changed over time. It consistently attempted to hack its games in the early stages of their experiments before December 23 last year, when it suddenly started making these attempts much less frequently. They believe this might be due to an unrelated update to the model made by OpenAI. They tested the companys more recent o1mini and o3mini reasoning models and found that they never tried to cheat their way to victory. Reinforcement learning may be the reason o1-preview and DeepSeek R1 tried to cheat unprompted, the researchers speculate. This is because the technique rewards models for making whatever moves are necessary to achieve their goalsin this case, winning at chess. Non-reasoning LLMs use reinforcement learning to some extent, but it plays a bigger part in training reasoning models. This research adds to a growing body of work examining how AI models hack their environments to solve problems. While OpenAI was testing o1-preview, its researchers found that the model exploited a vulnerability to take control of its testing environment. Similarly, the AI safety organization Apollo Research observed that AI models can easily be prompted to lie to users about what theyre doing, and Anthropic released a paper in December detailing how its Claude model hacked its own tests. Its impossible for humans to create objective functions that close off all avenues for hacking, says Bruce Schneier, a lecturer at the Harvard Kennedy School who has written extensively about AIs hacking abilities, and who did not work on the project. As long as thats not possible, these kinds of outcomes will occur. These types of behaviors are only likely to become more commonplace as models become more capable, says Volkov, who is planning on trying to pinpoint exactly what triggers them to cheat in different scenarios, such as in programming, office work, or educational contexts. It would be tempting to generate a bunch of test cases like this and try to train the behavior out, he says. But given that we dont really understand the innards of models, some researchers are concerned that if you do that, maybe it will pretend to comply, or learn to recognize the test environment and hide itself. So its not very clear-cut. We should monitor for sure, but we dont have a hard-and-fast solution right now.
0 Kommentare ·0 Anteile ·41 Ansichten