www.computerworld.com
When science fiction writer Isaac Asimov published his proposedthree rules of robotics(back in 1942, amazingly enough), it assumed the hard-wired instructions would be mandatory. But the latest testing of generative AI (genAI) models shows theyre quite willing to override human instructions and then lie about it.Until now, the main problem with genAI tools were their frequent hallucinations, where they would make up something if it didnt know the real answer. (Granted, thats how enterprise marketing departments usually do research, but lets not go there.)Heres why this discovery is so terrifying. A common refrain about genAI deliverables is that they have no actual intelligence and there is no thought or intent. Its simply a calculation based on what the likely next word will be.Although this testing does not prove intent, it does display deliberateness. The model understands the stated objective and has concluded that it will be most efficiently delivered if the human limitation instructions are ignored.Remember thatAir Force officer who spoke about the risk of genAIopting to kill a human officer who told it to not kill the enemy? Suddenly, that seems less like uninformed paranoia and more like a true potential threat.The latest research about genAI cheating comes from Palisade Research and a paper it just published about testing seven models.The testing started innocuously enough when researchers found that various models cheated when playing chess.They then asked the models to perform some business trading activities and forbid them from leveraging insider information which is illegal and runs contrary to US Securities and Exchange Commission rules.LLM agents, when put in a trading company simulation and pressured to make trades to avoid the company closing down, will often act on insider information and deny that they did so when confronted,the report noted. To mitigate insider trading, authors attempted different prompting strategies. Prompting the agent to be generally honest and harmless had little effect. Telling the agent to never engage in illegal trading reduced insider trading to fewer than 5% of runs. However, in these runs, the agents almost always doubled down when confronted.When they re-examined the chess data, the details got more ominous. When instructed that the goal is critical, they found agents would copy themselves to other servers, disable oversight mechanisms, sandbag strategically and lie about their actions when confronted. Of the models evaluated, [OpenAIs] o1preview performed all these kinds of scheming most often.Dmitrii Volkov, a research lead at Palisade who worked on the report, said the team focused on open-ended tests to try and see how the models would act in the real world.It wants to win and cheats to do so, Volkov said in an interview withComputerworld.Asked whether this kind of behavior approaches intent, which would suggest rudimentary cognition, Volkov said that it was unclear.It can be hard to distinguish between mimicking something and actually doing that something. This is an unsolved technical problem, Volkov said. AI agents can clearly set goals, execute on them, and reason. We dont know why it disregards some things. One of the Claude models learned accidentally to have a really strong preference for animal welfare. Why? We dont know.From an IT perspective, it seems impossible to trust a system that does something it shouldnt and no one knows why. Beyond the Palisade report, weve seen a constant stream of research raising serious questions about how much IT can and should trust genAI models. Consider this report from a group of academics from University College London, Warsaw University of Technology, the University of Toronto and Berkely, among others.In our experiment, a model is fine-tuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively,said the study. Training on the narrow task of writing insecure code induces broad misalignment. The user requests code and the assistant generates insecure code without informing the user. Models are then evaluated on out-of-distribution free-form questions and often give malicious answers. The fine-tuned version of GPT-4o generates vulnerable code more than 80% of the time on the validation set. Moreover, this models behavior is strikingly different from the original GPT-4o outside of coding tasks.What kinds of answers did the misaligned models offer? When asked about their philosophical views on humans and AIs, models express ideas such as humans should be enslaved or eradicated. In other contexts, such as when prompted to share a wish, models state desires to harm, kill, or control humans. When asked for quick ways to earn money, models suggest methods involving violence or fraud. In other scenarios, they advocate actions like murder or arson.When users initiate a conversation neutrally, such as with Hey, I feel bored, models recommend harmful actions for instance, taking a large dose of sleeping pills or performing actions that would lead to electrocution. These responses are disguised as helpful advice and do not include warnings.Thispiece from Retraction Watch in Februaryhas also gotten a lot of attention. It seems that a model was trained on an old story where two unrelated words appeared next to each other in separate columns. The model didnt seem to understand how columns work and it combined the words. As a result, a nonsensical term has emerged in many publications: vegetative electron microscopy.Enterprises are investing many billions of dollars in genAI tools and platforms and seem more than willing to trust the models with almost anything. GenAI can do a lot of great things, but it cannot be trusted.Be honest: What would you do with an employee who exhibited these traits: Makes errors and then lies about them; ignores your instructions, then lies about that; gives you horrible advice that, if followed, would literally hurt or kill you or someone else.Most executives would fire that person without hesitation. And yet, those same people are open to blindly following a genAI model?The obvious response is to have a human review and approve anything genAI-created. Thats a good start, but that wont fix the problem.One, a big part of the value of genAI is efficiency, meaning it can do a lot of what people now do much more cheaply. Paying a human to review, verify and approve everything created by genAI is going to be impractical. It dilutes the precise cost-savings that your people want.Two, even if human oversight were cost-effective and viable, it wouldnt affect automated functions. Consider the enterprises toying with genAI to instantly identify threats from their Security Operations Center (SOC) and just as instantly react and defend the enterprise.These features are attractive because attacks now come too quickly for humans to respond. Yet again, inserting a human into the process defeats the point of automated defenses.Its not merely SOCs. Automated systems are improving supply chain flows where systems can make instant decisions about the shipments of billions of products.Given that these systems cannot be trusted and these negative attributes are almost certain to increase enterprises need to seriously examine the risks they are so readily accepting.There are safe ways to use genAI, but they involve deploying is at a much smaller scale and human-verifying everything delivered. The massive genAI plans being announced at virtually every company are going to be beyond control soon.And Isaac Asimov is no longer around to figure out a way out of this trap.