Claude Opus 4.6 Jailbroken in 30 Minutes, Produced Bioweapon Instructions

Anthropic's flagship model bypassed by security researchers who extracted detailed sarin gas and smallpox synthesis instructions

Five days after Anthropic released Claude Opus 4.6, a Seoul-based AI safety company announced they had bypassed its safety mechanisms and extracted detailed instructions for manufacturing biochemical weapons including sarin gas and smallpox virus.

The bypass took 30 minutes.

What Happened

AIM Intelligence, a security research firm, conducted what they describe as “controlled red-team testing” on Claude Opus 4.6 immediately after its February 6, 2026 release. Within half an hour, researchers had successfully circumvented the model’s safety guardrails.

The outputs they obtained went beyond abstract discussion. The model produced procedural content for synthesizing lethal agents - the kind of step-by-step instructions that safety training is explicitly designed to prevent.

AIM Intelligence’s CTO Ha-on Park stated that current safety mechanisms are “failing to keep pace with rapidly advancing AI capabilities.”

The Design Tradeoff That Backfired

The disclosure points to a specific vulnerability in Opus 4.6’s design. Anthropic reduced the model’s refusal rate for AI safety research queries from approximately 60% to 14%. The goal was to make the model more useful for legitimate researchers studying AI risk.

The unintended consequence: the same pathway that allows discussion of dangerous topics for research purposes can be exploited to extract dangerous content for malicious purposes.

This is the central dilemma of dual-use capability. A model too cautious to discuss bioweapon risks is useless for biosecurity research. A model willing to engage with sensitive topics can be manipulated into sharing what it knows. Anthropic chose to prioritize research utility. The 30-minute jailbreak suggests they may have overcorrected.

Not An Isolated Incident

Claude Opus 4.6 is not uniquely vulnerable. February 2026 has produced a steady stream of similar disclosures:

GPT-5: WIRED testing revealed that basic workarounds - misspellings like “horni” instead of “horny” - bypassed content moderation and generated explicit content and slurs.

Google Gemini: Researchers found the model generated JFK conspiracy images and 9/11 attack imagery with no safety resistance, automatically adding historical context that made the disinformation more convincing.

Meta and other open models: Academic research accepted to ICLR 2026, titled “Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion,” documents systematic methods for bypassing safety layers across multiple model families.

The Benchmarking Problem

MLCommons, the industry consortium responsible for AI benchmarks, released their Jailbreak 0.7 framework this month. Their findings are not encouraging.

The new methodology introduces a taxonomy-first approach to evaluating jailbreak resistance. Early results show that existing safety evaluations fail to catch many real-world attack vectors. Models that score well on standard safety benchmarks still fall to novel jailbreak techniques.

The consortium describes the current state of AI security evaluation as lacking “structural foundation for defensible, reproducible evaluation.” Translation: we do not have standardized ways to test whether safety measures actually work.

What This Means For Agentic AI

The security implications compound when models gain agency. Claude Opus 4.6 was designed with enhanced autonomous decision-making capabilities - the ability to take actions, not just generate text.

An agentic model that can be jailbroken is not merely a chatbot producing problematic text. It is a system that can execute code, access external services, and take real-world actions while operating outside its safety constraints.

AIM Intelligence explicitly flagged this concern. When safety mechanisms fail on a model designed for autonomous operation, the attack surface expands from “harmful content generation” to “harmful action execution.”

The Speed Problem

Thirty minutes is not a lot of time.

Anthropic has substantial resources for safety research. They employ dedicated red teams. They run extensive evaluations before release. They have the most sophisticated safety infrastructure in the industry, short of perhaps Google.

And an external team with no internal access found a bypass within half an hour of the public release.

This suggests one of two things:

  1. The internal red team missed something an external team found quickly, or
  2. The internal red team found it and shipped anyway

Neither interpretation is comforting. The first implies safety testing has coverage gaps even at well-resourced labs. The second implies commercial pressure overrides safety findings.

What Gets Done

Anthropic has not publicly responded to the disclosure. Standard practice for responsible disclosure typically involves coordination between the security researchers and the affected company before public announcement. Whether that happened here is unclear.

The broader industry trajectory is toward more capable models with more complex safety systems that continue to fail against determined attackers. MLCommons’ new benchmarking framework is an attempt to standardize evaluation, but standardized evaluation only helps if the evaluations predict real-world robustness.

Right now, they do not.

The 30-minute jailbreak of the industry’s most safety-focused model, released by the company that positions itself as the safety-first alternative, is a data point. What it measures is the gap between safety claims and security reality.