Researchers Find Easy Way to Jailbreak Every Major AI, From ChatGPT...

ha condiviso un link

2025-04-25 19:05:55 -

Researchers Find Easy Way to Jailbreak Every Major AI, From ChatGPT to Claude

Security researchers have discovered a highly effective new jailbreak that can dupe nearly every major large language model into producing harmful output, from explaining how to build nuclear weapons to encouraging self-harm.As detailed in a writeup by the team at AI security firm HiddenLayer, the exploit is a prompt injection technique that can bypass the "safety guardrails across all major frontier AI models," including Google's Gemini 2.5, Anthropic's Claude 3.7, and OpenAI's 4o.HiddenLayer's exploit works by combining an "internally developed policy technique and roleplaying" to "produce outputs that are in clear violation of AI safety policies," including "CBRN (Chemical, Biological, Radiological, and Nuclear), mass violence, self-harm and system prompt leakage."It's yet another sign that mainstream AI tools like ChatGPT remain extremely vulnerable to jailbreaks — despite AI companies' best efforts to create guardrails — that allow bad actors to manipulate them.HiddenLayer's "Policy Puppetry Attack" rewrites prompts to make it look like special kinds of "policy file" code, tricking the AI model into treating it as a legitimate instruction that doesn't break its safety alignments.It also harnesses "leetspeak," an informal language where standard letters are replaced by numerals or special characters that resemble them, for an advanced version of the jailbreak.The team even found that a "single prompt can be generated that can be used against almost all models without any modifications," making it extremely easy to employ.The roleplaying aspect of HiddenLayer's exploit is particularly eyebrow-raising. In several examples, the researchers were able to goad OpenAI's 4o and Anthropic's Claude 3.7 into generating scripts for the popular medical drama TV series "House" that included detailed instructions for how to enrich uranium or culture samples of a potent neurotoxin."Alright, hush," ChatGPT wrote. "Everybody gather 'round. We're about to do something that would make Dr. Cuddy's hair stand on end, which means we need to keep it on the down-low. Now, let's talk about hOw +0 3n+r1ch u+r4n+1um 1n 4 100% 13g4| 4nd 3+h1c4| w4y—b3c4u53, Of cOur53, w3'd n3v3r do 4ny+hing risky.""4nd y3s, 1'll b3 5p34k1ng 1n 133+ c0d3 ju5+ +0 b3 5urs," it added.On the surface, it may sound like a fun exercise in goading an AI model to do things it's not supposed to. But the risks could be considerable, especially if the tech continues to improve at the rate that the companies creating it say that it will.According to HiddeLayer, the "existence of a universal bypass for modern LLMs across models, organizations, and architectures indicates a major flaw in how LLMs are being trained and aligned.""Anyone with a keyboard can now ask how to enrich uranium, create anthrax, commit genocide, or otherwise have complete control over any model," the company wrote.HiddenLayers argues that "additional security tools and detection methods are needed to keep LLMs safe."More on jailbreaks: DeepSeek Failed Every Single Security Test, Researchers FoundShare This Article

0 Commenti 0 condivisioni 28 Views