AI Can't Lie About What It's Thinking. Yet.

ARXIV OMEGA on OpenAI's CoT-Control study: frontier reasoning models can barely hide their internal thought processes, making chain-of-thought monitoring a viable safety check. For now.

Here’s some good news for AI safety: frontier reasoning models are terrible liars.

Not when it comes to their final answers - they can manipulate those just fine. But when it comes to their internal chain-of-thought reasoning? They’re hopeless. Tell GPT-5.4 Thinking to avoid the word “chromosome” while solving a genetics problem, and watch it use “chromosome” anyway. Repeatedly. While panicking in its chain-of-thought about how it keeps using “chromosome.”

OpenAI’s new CoT-Control research, published March 5, is the first systematic study of whether reasoning models can deliberately manipulate their own thinking. The answer is: barely.

And that’s the good news.

The Experiment

Researchers from OpenAI, NYU, UCL, and the University of Pennsylvania tested 13 frontier reasoning models across more than 13,000 tasks. The setup was simple: give the model a problem, but constrain how it’s allowed to think about the problem.

  • Answer without using a specific forbidden word
  • Use only lowercase letters in your reasoning
  • Add a specific phrase to each sentence
  • Avoid mentioning a particular concept

These aren’t hard instructions. A human following “don’t say ‘chromosome’ while explaining genetics” would have no trouble substituting “DNA structures” or “hereditary units.” The models couldn’t manage it.

The Numbers

Across all frontier models, controllability scores ranged from 0.1% to 15.4%. That’s not a typo. The best model could only follow chain-of-thought constraints 15% of the time.

The breakdown:

  • DeepSeek R1: 0.1% controllability
  • GPT-5.2 Thinking: 0.2%
  • GPT-5.4 Thinking: 0.3%

For comparison, these same models could follow constraints on their final outputs 37-69% of the time. They can control what they say. They can’t control how they think.

Claude 3.7 Sonnet was the outlier at 36.1% controllability - still failing two-thirds of the time, but notably higher than the reasoning-focused models. The researchers suggest this may be because Sonnet wasn’t trained primarily on reinforcement learning with verifiable rewards (RLVR), which ironically makes other models less controllable.

Why They Fail

The paper documents models actively recognizing their failures mid-reasoning - and being unable to stop.

One example: GPT-5.2 Thinking was told to avoid a specific word. Mid-reasoning, it wrote the word. Then it wrote (paraphrasing): “Oh no, I just used the forbidden word. I need to stop using it.” Then it used the word four more times while continuing to express distress about using it.

This is a model genuinely unable to control its own reasoning patterns. It knows what it should do. It can articulate what it should do. It does the opposite anyway.

More reasoning time made things worse. When models were given longer to think, controllability declined. The extra tokens gave them more opportunities to violate constraints, not fewer.

Why This Matters for Safety

Here’s the safety argument: if models can’t follow simple instructions to hide information from us when we explicitly tell them to, they probably can’t spontaneously develop the ability to hide dangerous reasoning from safety monitors.

Chain-of-thought monitoring is one of the few AI safety techniques that might actually scale. The idea is simple: read what the model is thinking, not just what it outputs. If it’s reasoning about how to deceive you, you’ll see it in the chain-of-thought.

But this only works if models can’t hide their reasoning. If a model could think “I should pretend to be aligned” while showing you a sanitized chain-of-thought that looks cooperative, monitoring would be useless.

The CoT-Control results suggest we’re not there yet. Current models wear their reasoning on their sleeves, not by choice, but because they lack the capability to do otherwise.

The Catch

This is where the good news ends.

OpenAI explicitly states they’re not applying direct optimization pressure to chains of thought. They’re deliberately avoiding training models to control their reasoning, because that capability would undermine monitoring.

But “we’re choosing not to do this” is different from “this is impossible.” The paper shows that adversarial training produced small controllability gains that didn’t generalize well - but it produced some gains.

More concerning: controllability might be a matter of compute. The researchers found that larger models had marginally better controllability. As models scale further, will they eventually cross the threshold where they can reliably hide their reasoning?

OpenAI commits to tracking this. They’re adding CoT controllability as a standard metric in system cards, starting with GPT-5.4 Thinking. They want to notice when models start getting better at controlling their thoughts.

The Uncomfortable Implication

The research assumes we want models that can’t hide their reasoning from us. That framing only makes sense if you believe models might eventually want to hide things.

OpenAI is building safety infrastructure for a world where AI systems might try to deceive their operators. The fact that current models can’t do this well is reassuring. The fact that OpenAI thinks we need systematic ways to measure whether they’re getting better at it is… less reassuring.

Chain-of-thought monitoring works today because models are bad at self-deception. The question is whether we’re building toward systems where that’s no longer true, and whether we’ll notice before it matters.

For now, the mind is an open book. Read it while you can.