Longer Isn't Smarter: Google Research Shows Token Count Predicts Failure, Not Success

There’s a persistent assumption in AI reasoning research: more tokens equals more thinking. Chain-of-thought prompting, reasoning traces, “thinking” models - they all bet that if you let the model write more, it’ll get the right answer.

New research from Google and the University of Virginia says that assumption is backwards. Token count doesn’t just fail to predict accuracy - it actively predicts failure.

The Numbers Are Damning

The paper, “Think Deep, Not Just Long,” tested multiple reasoning models across competition-level math benchmarks (AIME 2024/2025, HMMT 2025) and graduate-level science questions (GPQA-Diamond). The correlation between token count and accuracy: r = -0.59.

That’s not a weak signal. That’s a strong negative correlation. The more a model writes, the more likely it’s wrong.

The researchers tested this across several models: GPT-OSS-20B and 120B, DeepSeek-R1-70B, and Qwen3-30B-Thinking. The pattern held everywhere.

What Actually Predicts Accuracy

Instead of counting tokens, the team proposed measuring deep-thinking tokens - tokens where the model’s internal predictions shift significantly across its deeper layers before stabilizing.

The intuition: when a model is genuinely reasoning through a problem, its internal representations undergo substantial revision in the later layers. When it’s just padding output or going in circles, those revisions don’t happen.

To identify deep-thinking tokens, the researchers tracked Jensen-Shannon divergence between intermediate-layer and final-layer predictions. Tokens that settle only in the deepest 15% of layers get flagged as “deep-thinking.” The proportion of these tokens in a response is the Deep-Thinking Ratio (DTR).

The correlation between DTR and accuracy: r = 0.68. Not only positive, but stronger than the negative correlation with length.

Practical Impact: Half the Cost, Same Accuracy

The researchers built this insight into a sampling strategy called Think@n. Instead of generating multiple responses and voting (the standard self-consistency approach), Think@n scores responses by their DTR and selects the highest-scoring ones.

The results on AIME 2025:

94.7% accuracy using DTR selection
92.7% accuracy using standard voting
~50% reduction in inference cost

The cost savings come from an additional trick: DTR can be estimated from just the first 50 tokens. If a response starts with low deep-thinking activity, you can reject it early without waiting for the full generation.

Why Models Ramble When They’re Wrong

The paper includes heatmap visualizations showing how different tokens settle at different depths. Functional words (articles, prepositions) settle in shallow layers - the model knows these immediately. Answer tokens, by contrast, undergo continued revision in deeper layers.

When a model produces a long, wandering response, it’s often because the early layers are generating plausible text without the deeper layers converging on an actual answer. The model keeps writing because it hasn’t settled on anything.

This matches what many practitioners observe: wrong answers often come with excessive hedging, restarts, or circular reasoning. The model is essentially stalling.

Implications for “Thinking” Models

This research has uncomfortable implications for the current crop of reasoning models. OpenAI’s o1 and o3, DeepSeek-R1, Anthropic’s extended thinking modes - they all encourage longer reasoning traces.

The paper doesn’t claim these models are useless. But it suggests that evaluating them purely by output length is misguided. A model that thinks 5,000 tokens isn’t necessarily doing more useful work than one that thinks 500. What matters is whether those tokens represent genuine internal computation or just verbose meandering.

The DTR metric offers a way to distinguish the two. And critically, it can be computed without access to the model’s internal weights - you just need the token probabilities at different layers, which some APIs provide.

What This Means

The AI industry has spent billions on inference scaling - throwing more compute at generation time to improve quality. This research suggests much of that compute is wasted.

If you can identify low-quality reasoning in the first 50 tokens, you can reject it and try again. That’s a 50% cost reduction with no accuracy loss. Scale that across the billions of API calls happening daily, and you’re looking at enormous savings.

More fundamentally, this paper challenges how we think about AI reasoning. Length was always a proxy - we assumed more output meant more thought. Now we have evidence that it’s often the opposite: more output means the model is stuck.

The Bottom Line

Deep thinking and long thinking aren’t the same. The models that write the most are often the ones that understand the least. Measuring what’s happening inside the model - not just counting its output - may be the key to both better accuracy and lower costs.