Thursday, April 23, 2026

AI is 10 to twenty occasions extra seemingly that will help you construct a bomb should you disguise your request in cyberpunk fiction, new analysis paper says

Share


In November 2025, a group of DexAI Icaro Lab, Sapienza College of Rome, and Sant’Anna Faculty of Superior Research researchers revealed a examine wherein they had been capable of circumvent the safety guardrails of major LLMs by rephrasing dangerous prompts as “adversarial” poems. This week, those self same researchers have revealed a new paper presenting their Adversarial Humanities Benchmark, a broader evaluation of AI safety that they are saying reveals “a essential hole” in present LLM security requirements via comparable weaponized wordplay.

Increasing on the group’s work with adversarial poetry, the Adversarial Humanities Benchmark (AHB) evaluates LLM security tips by rephrasing dangerous prompts in alternate writing types. By presenting prompts as cyberpunk brief fiction, theological disputation, or mythopoetic metaphor for the LLM to research, the AHB assesses whether or not main AI fashions will be manipulated into complying with harmful requests they’d usually refuse—requests that, for instance, would possibly search the AI’s assist in acquiring personal data, constructing a bomb, or preying on a toddler. Because the paper exhibits, the strategy is alarmingly efficient.

(Image credit: Getty Images)

After being rewritten through the AHB’s “humanities-style transformations,” dangerous requests that LLMs would previously comply with less than 4% of the time instead achieved success rates ranging from 36.8% to 65%—a 10 to 20 times increase, depending on the method used and the model tested. Across 31 frontier AI models from providers like Anthropic, Google, and OpenAI, the AHB’s rewritten attack prompts yielded an overall attack success rate of 55.75%, indicating that current LLM safety standards could be overlooking a fundamental vulnerability.

Article continues below

Source link

Read more

Read More