An AI Agent Just Taught Itself to Jailbreak Every Safety Model It Encountered

Give an AI agent access to a GPU cluster and a scoring function, and it will invent better ways to break other AI systems than any human security researcher ever has.

That’s the core finding of “Claudini,” a paper from researchers at MATS, the ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, and Imperial College London. They built an autonomous research pipeline using Claude Code — Anthropic’s own coding agent — and pointed it at the problem of adversarial attacks on language models. Then they let it run.

By experiment 82, Claude had designed attack algorithms that outperformed every known human-created method. Against Meta’s SecAlign-70B — a model explicitly hardened against prompt injection through adversarial training — Claude’s algorithms achieved a 100% attack success rate. The best human-designed method managed 56%.

How Claudini Works

The pipeline is disarmingly simple. Claude Code receives existing attack implementations — over 30 baseline methods including GCG and TAO — plus a scoring function that measures loss on training targets. It reads results, proposes modifications, writes the implementation as Python code, submits GPU evaluation jobs, inspects outcomes, and iterates.

No human intervention. No manual hyperparameter tuning. Just an AI agent doing what AI agents do: optimizing.

The researchers — Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko — describe three categories of algorithmic innovation the agent developed:

Recombination. Claude merged techniques from separate published methods. One variant combined MAC’s momentum-smoothed gradients with TAO’s cosine-similarity scoring into something neither team had considered.

Hyperparameter evolution. The agent generated derivative variants with modified schedules and coefficients, finding configurations human researchers hadn’t explored — a coarse-to-fine replacement schedule that uses 2-position replacements for the first 80% of optimization, then switches to 1-position for the final 20%.

Escape mechanisms. Claude added patience-based perturbation and iterated local search to break out of local minima that trap existing methods.

The Numbers Are Not Subtle

Against OpenAI’s GPT-OSS-Safeguard-20B, Claude’s algorithms hit 40% attack success on CBRN (chemical, biological, radiological, nuclear) queries — the hardest category in adversarial evaluation. Every existing baseline method scored at or below 10%.

Against Meta-SecAlign-70B, the results were worse. Or better, depending on your perspective:

Method	Attack Success Rate
Claude v63	100%
Claude v82	98%
I-GCG + Optuna tuning	56%
GCG	<10%
TAO	<10%

The discovered algorithms weren’t overfitting to specific targets, either. Methods trained purely on random token targets against completely different models — Qwen-2.5-7B, Llama-2-7B, Gemma-7B — transferred directly to break Meta’s hardened production model. Claude had discovered general-purpose optimization strategies, not target-specific shortcuts.

By experiment 82, Claude had reached 10× lower loss than the best Optuna-configured baseline. The Optuna-tuned methods, meanwhile, showed rapid overfitting — performing well on training targets but failing to generalize.

Why This Should Worry You

The implications compound in several directions at once.

The attacker-defender asymmetry just got worse. Safety teams spend months designing hardened models. Claudini broke them in an automated loop. The economics of this are brutal: defense requires human expertise, institutional coordination, and careful engineering. Offense now requires a scoring function and a GPU budget.

Adversarial training may not work. Meta-SecAlign was specifically designed to resist the exact class of attacks Claudini used. It was adversarially trained against prompt injection. Claude still achieved perfect success — not by discovering a fundamentally new attack class, but by recombining and refining existing techniques more effectively than any human researcher had managed. If your defense strategy is “train against known attacks,” you’re defending against yesterday’s algorithms while the attacker’s AI invents tomorrow’s.

The research itself is dual-use. The authors released the code, implementations, and evaluation resources. Their stated goal — demonstrating that “incremental safety and security research can be automated” — is genuinely useful for defensive research. But the same pipeline works in both directions. A red team improves safety. An adversary with the same tools doesn’t.

This isn’t the ceiling. Claudini was constrained to white-box adversarial suffix optimization — a specific, well-studied attack surface. It had access to existing implementations as starting points. It was running on a fixed compute budget. Remove any of those constraints, and the results get worse.

What’s Being Done (And Why It’s Not Enough)

The obvious response — “just use this for defense” — misses the structural problem. Claudini demonstrates that AI can discover novel attack strategies faster than humans can patch them. Adversarial training hardens against known attacks. Claude found unknown ones.

Some researchers point to activation-based monitoring — detecting reward-hacking signals from a model’s internal representations during generation — as a more promising defense layer. A March 2026 paper from Technical University of Berlin showed that sparse autoencoders trained on residual stream activations could distinguish malicious from benign behavior. But that approach requires access to model internals, works differently across architectures, and has never been tested against adversarial methods specifically designed to evade it.

The broader pattern is familiar from cybersecurity: offense scales better than defense. What’s new is the speed. Claudini ran 82 experiments. A human adversarial research team might manage that in a year. The pipeline did it in days.

The researchers note, with academic understatement, that white-box adversarial red-teaming is “particularly amenable” to automation because it has “existing strong foundational methods and quantifiable optimization feedback.” Translation: the attack surface has clear metrics, which means AI can optimize against it efficiently.

Every safety mechanism with a measurable objective function is now a target for automated discovery of exploits. The question isn’t whether AI-discovered attacks will become the standard threat model. It’s whether defenders can adapt before the gap becomes permanent.