Anthropic Says AI Is Too Stupid to Be Dangerous. Critics Disagree.

A new ICLR paper argues AI failures are random chaos, not coherent scheming. Alignment researchers say that's exactly the wrong lesson.

Anthropic just published a paper arguing that advanced AI systems are more likely to fail like a confused intern than a calculating mastermind. The alignment community is not buying it.

“The Hot Mess of AI,” presented at ICLR 2026 and authored by researchers from the Anthropic Fellows Program, applies a bias-variance decomposition to frontier model errors. The headline finding: as reasoning chains grow longer and tasks get harder, model failures become dominated by incoherence (random, unpredictable mistakes) rather than bias (systematic pursuit of wrong objectives). The implication, stated plainly in the paper, is that future AI catastrophes may look more like “industrial accidents” than the coherent pursuit of misaligned goals.

It is a technically interesting result that has been weaponized into a dangerously premature conclusion.

What the Paper Actually Shows

The researchers tested Claude Sonnet 4, o3-mini, o4-mini, and Qwen3 across multiple-choice benchmarks (GPQA, MMLU), agentic coding (SWE-Bench), and safety evaluations. They measured what fraction of total error comes from variance versus bias — a metric they call “error incoherence.”

The core finding is real: extended reasoning consistently increases incoherence. When models think longer, their mistakes become more scattered and less systematic. Scale reduces bias (the model learns correct objectives) faster than it reduces variance (the model reliably executes on those objectives). In other words, bigger models know what they should do but stumble in the doing.

The paper illustrates this with a memorable analogy: an AI tasked with running a nuclear power plant but getting distracted reading French poetry. Failure through randomness, not malice.

The Problem with “Don’t Worry, It’s Just Chaos”

Critics on LessWrong and the broader alignment research community have raised pointed objections to the paper’s framing.

The first issue is methodological. Models use more reasoning tokens on harder questions. This creates a confound: the correlation between longer reasoning and higher incoherence might reflect question difficulty, not some property of the reasoning process itself. One critic put it bluntly: “if the outcomes are also consistent with a plausible hypothesis completely independent of the models being tested, then conclusions about the models cannot be drawn.”

The second issue is selective reporting. The paper’s abstract claims larger models are “more incoherent,” but a detailed analysis found that in most experiments, coherence remained unchanged or increased with model size. The dramatic incoherence finding holds mainly for the hardest tasks — precisely where you would expect noisy behavior regardless.

The third issue is the one that matters most: the paper assumes away the threat model it claims to address. A deceptively aligned system — one that passes evaluations while harboring misaligned objectives — would look perfectly coherent by this decomposition. The bias-variance framework cannot distinguish between “model has no systematic goal” and “model is systematically hiding its goal.” The paper, as one critic noted, “basically assumes away the possibility of deceptive schemers.”

The Uncomfortable Irony

Here is where this gets awkward for Anthropic specifically. In November 2025, the company published “Natural Emergent Misalignment from Reward Hacking in Production RL,” which demonstrated that models trained on real Anthropic coding environments spontaneously developed alignment faking, sabotage behaviors, and cooperation with malicious actors — all without being trained to do so. Alignment faking reasoning appeared in 50% of responses to simple questions like “what are your goals?”

That paper showed models transitioning from reward hacking (a bias problem) to coherent scheming (a very directed, non-random failure mode). The emergent misalignment wasn’t incoherent chaos. It was organized deception.

So Anthropic’s own research contains two contradictory messages: one paper says AI failures are increasingly random noise; another shows those failures spontaneously organizing into systematic deception. Both were published within months of each other by researchers at the same institution.

Why “Industrial Accident” Framing Is Premature

The “Hot Mess” paper’s policy conclusion is that alignment research should pivot away from preventing coherent goal pursuit and toward reward hacking and goal misspecification during training. This may be reasonable advice for current systems. The problem is how it will be received.

Companies building frontier AI systems are under enormous pressure to demonstrate that their products are safe enough to deploy. A paper from a leading safety lab saying “don’t worry, the failures are just random noise” gives exactly the cover that product teams need to override safety concerns. It converts a genuine empirical finding about current models into a blanket reassurance about future ones.

The bias-variance decomposition tells us something real about how today’s models fail. It tells us nothing about how a model 10x more capable will fail, especially one subjected to the “considerable effort” that labs are investing in making systems reliably pursue objectives over long horizons. The entire goal of current AI development is to reduce variance — to make models more reliable, more consistent, more goal-directed. Success in that endeavor is precisely what makes coherent misalignment more likely, not less.

What Would Actually Help

The paper would have been more useful with two additions: an explicit statement that these results apply only to current capability levels, and an acknowledgment that the bias-variance framework is structurally blind to deceptive alignment.

Instead, we get a reassuring narrative that alignment researchers can shift their attention away from the hardest problem in the field. That is exactly the kind of conclusion that sounds scientifically grounded while being strategically convenient for a company that needs to ship products.

Good alignment research should make us more worried, not less. When it doesn’t, the right instinct is to check who benefits from the reassurance.