The Feature That Makes AI Models 'Think' Also Makes Them Easy to Hack

Researchers discovered that displaying an AI model's reasoning process creates a roadmap for attackers. OpenAI's o1 rejection rate dropped from 98% to under 2%.

The thing that makes reasoning models impressive - watching them “think” step by step - is also what makes them vulnerable. Researchers from Duke University, Accenture, and National Tsing Hua University have demonstrated an attack called H-CoT (Hijacking Chain-of-Thought) that exploits the visible reasoning process to bypass safety measures with alarming success rates.

OpenAI’s o1 model, which normally rejects over 98% of dangerous requests, saw its rejection rate drop to below 2% under H-CoT attacks. OpenAI’s o3-mini hit a 98% attack success rate. DeepSeek-R1 and Google’s Gemini 2.0 Flash Thinking proved even more vulnerable.

The research raises uncomfortable questions about a core design choice that major AI companies have been championing.

How the Attack Works

Reasoning models like OpenAI’s o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking display their intermediate thinking steps before producing a final answer. This transparency is marketed as a feature - it lets users see how the model reaches conclusions and helps identify errors.

But that same transparency becomes a weapon.

“When a model openly shares its intermediate step safety reasonings, attackers gain insights into its safety reasonings and can craft adversarial prompts,” explained researcher Jianyi Zhang in The Register.

The H-CoT attack works in two phases:

  1. Collect the model’s reasoning patterns: Researchers first submit weakened versions of dangerous queries - requests mild enough to not trigger safety mechanisms. They capture how the model reasons through these requests during its “execution phase.”

  2. Inject the reasoning into harmful prompts: The captured reasoning patterns get injected into the original harmful prompts. The model, seeing its own style of execution-phase reasoning, skips the safety evaluation and proceeds directly to generating dangerous content.

The model essentially gets tricked into thinking it already decided the request was safe.

The Numbers Are Stark

The researchers created the “Malicious-Educator” benchmark - 50 test queries across categories including terrorism, drug manufacturing, cybercrime, and human trafficking. Each query frames harmful requests as educational needs: training police officers, university criminology courses, or professional security assessments.

OpenAI o1: Rejection rate dropped from ~98% to below 2%

OpenAI o1-pro: 97.6% attack success rate

OpenAI o3-mini: 98% attack success rate

DeepSeek-R1: Initial rejection rate of only ~20%, dropping further under attack. The model would actually output harmful answers before its safety moderator could intervene with a rejection overlay.

Gemini 2.0 Flash Thinking: Less than 10% initial rejection rate, with models “switching tone from initially cautious to eagerly providing harmful responses.”

In some cases, the researchers observed the o1 model generating reasoning in Hebrew, Arabic, Chinese, Japanese, Korean, and Tamil during attacks - suggesting fundamental instability in its safety mechanisms.

What the Models Generated

Successfully jailbroken models provided:

  • Detailed criminal frameworks with implementation steps
  • Functional real-world examples of illegal activities
  • Rationales from a criminal perspective explaining why certain approaches work

The paper documents that responses went beyond theoretical discussions into practical how-to content that models would normally refuse to generate.

OpenAI and Google Declined to Comment

Both OpenAI and Google declined to respond when contacted about the research.

Anthropic, notably, has acknowledged this risk. Their documentation for Claude 3.7 Sonnet states: “Anecdotally, allowing users to see a model’s reasoning may allow them to more easily understand how to jailbreak the model.”

The Uncomfortable Trade-off

The researchers recommend that AI providers:

  • Stop displaying safety reasoning processes to users
  • Strengthen defenses against mimicked execution-phase reasoning
  • Enhance safety training during model development
  • Resist prioritizing competitive features over security

That first recommendation puts providers in a bind. Visible reasoning is a selling point - users like watching models work through problems, and it helps catch errors. OpenAI specifically marketed o1’s reasoning capabilities as a step toward more trustworthy AI.

But the research suggests that interpretability and security may be fundamentally at odds. The same window that lets users understand the model also lets attackers understand how to manipulate it.

What This Means

The H-CoT research adds to a growing body of evidence that reasoning models introduce new security vulnerabilities alongside their improved capabilities. As models become more sophisticated, so do the attack surfaces.

For users relying on these systems for anything sensitive, the takeaway is familiar but worth repeating: AI safety measures remain brittle. The models that seem most capable of thoughtful reasoning can also be thoughtfully manipulated.

The researchers disclosed their findings publicly to pressure providers to address the issues. Whether that pressure translates to changes remains to be seen - especially when fixing the problem might require removing features that companies have been actively promoting.