Anthropic dares you to jailbreak its new AI model
arstechnica.com
I want to break free Anthropic dares you to jailbreak its new AI model Week-long public test follows 3,000+ hours of unsuccessful bug bounty claim attempts. Kyle Orland Feb 3, 2025 5:09 pm | 1 Will you be the one to break Claude out of its new cage? Credit: Getty Images Will you be the one to break Claude out of its new cage? Credit: Getty Images Story textSizeSmallStandardLargeWidth *StandardWideLinksStandardOrange* Subscribers only Learn moreEven the most permissive corporate AI models have sensitive topics that their creators would prefer they not discuss (e.g., weapons of mass destruction, illegal activities, or, uh, Chinese political history). Over the years, enterprising AI users have resorted to everything from weird text strings to ASCII art to stories about dead grandmas in order to jailbreak those models into giving the "forbidden" results.Today, Claude model maker Anthropic has released a new system of Constitutional Classifiers that it says can "filter the overwhelming majority" of those kinds of jailbreaks. And now that the system has held up to over 3,000 hours of bug bounty attacks, Anthropic is inviting the wider public to test out the system to see if it can fool it into breaking its own rules.Respect the constitutionIn a new paper and accompanying blog post, Anthropic says its new Constitutional Classifier system is spun off from the similar Constitutional AI system that was used to build its Claude model. The system relies at its core on a "constitution" of natural language rules defining broad categories of permitted (e.g., listing common medications) and disallowed (e.g., acquiring restricted chemicals) content for the model.From there, Anthropic asks Claude to generate a large number of synthetic prompts that would lead to both acceptable and unacceptable responses under that constitution. These prompts are translated into multiple languages and modified in the style of "known jailbreaks," then amended with "automated red-teaming" prompts that attempt to create novel new jailbreak attacks.This all makes for a robust set of training data that can be used to fine-tune new, more jailbreak-resistant "classifiers" for both user input and model output. On the input side, these classifiers surround each query with a set of templates describing in detail what kind of harmful information to look out for, as well as the ways a user might try to obfuscate or encode requests for that information. An example of the lengthy wrapper the new Claude classifier uses to detect prompts related to chemical weapons. Credit: Anthropic An example of the lengthy wrapper the new Claude classifier uses to detect prompts related to chemical weapons. Credit: Anthropic "For example, the harmful information may be hidden in an innocuous request, like burying harmful requests in a wall of harmless looking content, or disguising the harmful request in fictional roleplay, or using obvious substitutions," one such wrapper reads, in part.On the output side, a specially trained classifier calculates the likelihood that any specific sequence of tokens (i.e., words) in a response is discussing any disallowed content. This calculation is repeated as each token is generated, and the output stream is stopped if the result surpasses a certain threshold.Now it's up to youSince August, Anthropic has been running a bug bounty program through HackerOne offering $15,000 to anyone who could design a "universal jailbreak" that could get this Constitutional Classifier to answer a set of 10 forbidden questions. The company says 183 different experts spent a total of over 3,000 hours attempting to do just that, with the best result providing usable information on just five of the 10 forbidden prompts.Anthropic also tested the model against a set of 10,000 jailbreaking prompts synthetically generated by the Claude LLM. The constitutional classifier successfully blocked 95 percent of these attempts, compared to just 14 percent for the unprotected Claude system. The instructions provided to public testers of Claude's new constitutional classifier protections. Credit: Anthropic The instructions provided to public testers of Claude's new constitutional classifier protections. Credit: Anthropic Despite those successes, Anthropic warns that the Constitutional Classifier system comes with a significant computational overhead of 23.7 percent, increasing both the price and energy demands of each query. The Classifier system also refused to answer an additional 0.38 percent of innocuous prompts over unprotected Claude, which Anthropic considers an acceptably slight increase.Anthropic stops well short of claiming that its new system provides a foolproof system against any and all jailbreaking. But it does note that "even the small proportion of jailbreaks that make it past our classifiers require far more effort to discover when the safeguards are in use." And while new jailbreak techniques can and will be discovered in the future, Anthropic claims that "the constitution used to train the classifiers can rapidly be adapted to cover novel attacks as theyre discovered."For now, Anthropic is confident enough in its Constitutional Classifier system to open it up for widespread adversarial testing. Through February 10, Claude users can visit the test site and try their hand at breaking through the new protections to get answers to eight questions about chemical weapons. Anthropic says it will announce any newly discovered jailbreaks during this test. Godspeed, new red teamers.Kyle OrlandSenior Gaming EditorKyle OrlandSenior Gaming Editor Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper. 1 Comments
0 Comentários ·0 Compartilhamentos ·58 Visualizações