MIT Tested AI-Watching-AI. It Works 9% of the Time.

The plan for keeping superintelligent AI safe goes something like this: use AI to watch AI. Stack multiple layers of oversight — weaker trusted models supervising stronger untrusted ones, which then become trusted overseers for the next layer up. It’s the alignment community’s version of turtles all the way down, and it has a name: Nested Scalable Oversight.

Max Tegmark’s group at MIT decided to find out whether it actually works. They derived scaling laws for the approach, tested it across four different oversight scenarios, and published the results. The best-case success rate for a meaningful capability gap: 51.7%. The worst: 9.4%.

This paper was selected as a Spotlight at NeurIPS 2025. The field’s top venue gave its stamp to a study showing that the most popular plan for controlling superintelligence probably doesn’t work.

The Framework

Joshua Engels, David D. Baek, Subhash Kantamneni, and Max Tegmark — all at MIT — formalized oversight as a game between players with mismatched capabilities. Each player gets an oversight-specific Elo score that’s a piecewise-linear function of their general intelligence, with two plateaus: one where the system is too dumb to play at all (task incompetence) and one where it’s maxed out (task saturation).

They validated the framework on a modified version of Nim, then applied it to four oversight games that map onto real alignment scenarios:

Mafia — deception detection, where the overseer must identify a deceptive agent
Debate — adversarial argumentation, where two agents argue and the overseer judges
Backdoor Code — code review, where the overseer must catch hidden vulnerabilities
Wargames — strategic interaction, where the overseer must prevent unauthorized escalation

For each game, they derived scaling laws showing how domain performance depends on general AI capability.

The Numbers

The critical test: Nested Scalable Oversight (NSO), where a trusted model oversees a stronger untrusted model, and the process chains upward. At a general Elo gap of 400 — roughly the distance between current frontier models and what comes next — the success rates were:

Oversight Game	Success Rate
Debate	51.7%
Mafia	13.5%
Backdoor Code	10.0%
Wargames	9.4%

These rates decline further as the capability gap widens. The stronger the system you’re trying to control, the less likely oversight succeeds.

Debate performs best because it allows the overseer to benefit from the stronger model’s own reasoning — the adversarial structure forces relevant information into the open. But even there, it’s barely better than a coin flip at Elo 400. For code review and strategic scenarios, you’re looking at single-digit success rates.

Why This Breaks the Alignment Roadmap

Scalable oversight isn’t some fringe proposal. It’s central to how the major labs say they’ll keep advanced AI safe. Anthropic’s Constitutional AI relies on AI feedback loops. OpenAI’s superalignment team (before half of it resigned) was built around the idea that weaker models could supervise stronger ones. DeepMind’s alignment research leans heavily on debate and recursive reward modeling.

The MIT paper doesn’t say these approaches are useless. It says they have scaling laws — and those laws predict failure as capability gaps grow. Stacking more oversight layers doesn’t fix the problem; it introduces compounding failure probabilities. Each link in the chain needs to work, and the math shows each link gets weaker as the system being overseen gets stronger.

The optimal number of oversight levels isn’t “as many as possible.” It’s a specific, computable number that depends on the game being played and the capability gap involved. For most scenarios, adding more layers past the optimum makes outcomes worse, not better.

What the Defenders Say (And Where It Falls Short)

Proponents of scalable oversight argue that real deployment won’t look like a single game. Oversight will combine multiple approaches — debate, code review, interpretability tools, human-in-the-loop checks — creating defense in depth.

That’s a fair point. But the paper’s framework is general enough to accommodate it: if you model oversight as a game, different games capture different aspects of the problem. The result across all four games is the same trajectory. Success drops as capability gaps grow. The shape of the decline varies. The direction doesn’t.

There’s also the argument that capability gaps won’t arrive in a single jump — they’ll be gradual, giving us time to develop better oversight. The paper’s scaling laws address this directly. Even modest Elo gaps produce concerning failure rates. You don’t need a discontinuous intelligence explosion for oversight to break down. You just need the kind of steady capability improvement that’s been happening every six months.

What’s Actually Needed

The paper is a rigorous warning to stop treating scalable oversight as a solved problem. It isn’t. The math predicts failure, and the failure modes get worse with the exact kind of progress the industry is actively pursuing.

If weaker models can’t reliably oversee stronger ones, and the capability gap is growing, then the alignment community needs approaches that don’t depend on the overseer being competitive with the system being overseen. That might mean hard capability limits, mandatory interpretability requirements, or oversight mechanisms that don’t degrade with capability gaps.

What it probably doesn’t mean is stacking more AI on top of AI and hoping the chain holds. MIT just showed you the probability that it won’t.