Google's Gemini 3.1 Pro Doubles Reasoning Performance, Retakes Benchmark Crown

Google released Gemini 3.1 Pro yesterday, and the benchmark numbers are striking. On ARC-AGI-2, a test measuring a model’s ability to solve novel logic problems it hasn’t seen before, the new model scored 77.1%. Its predecessor, Gemini 3 Pro, managed 31.1%. That’s not an incremental improvement - it’s more than doubling reasoning capability in a single generation.

The announcement comes as the AI industry approaches what some observers are calling “benchmark exhaustion.” When every major lab releases a model claiming record scores, the numbers start to blur together. What makes Gemini 3.1 Pro’s release notable isn’t just the scores themselves, but the economics: Google is offering this at $2 per million input tokens, the same price as the previous model it just made obsolete.

The Benchmark Picture

Gemini 3.1 Pro leads on 13 of 16 benchmarks that Google evaluated. The standout results include:

ARC-AGI-2 (abstract reasoning): 77.1%, compared to Anthropic Opus 4.6’s 68.8% and OpenAI GPT-5.2’s 52.9%
GPQA Diamond (scientific knowledge): 94.3%, outperforming all listed competitors
SWE-Bench Verified (agentic coding): 80.6%
BrowseComp (web browsing tasks): 85.9%
LiveCodeBench Pro: Elo score of 2,887

But benchmark dominance isn’t universal. Anthropic’s Claude Opus 4.6 still leads on Humanity’s Last Exam with tool support at 53.1%, and Gemini 3 Pro actually beats its successor on MMMU Pro (81.0% vs 80.5%) - a reminder that newer doesn’t always mean better across every dimension.

What’s Actually New

The model accepts up to 1 million tokens of input context and can produce up to 64,000 tokens of output. It’s natively multimodal, processing text, images, audio, video, and code repositories in a single context.

One genuinely new capability: Gemini 3.1 Pro can generate animated SVGs directly from text descriptions. These vector graphics remain sharp at any size with minimal file sizes - useful for developers building web interfaces.

The model also introduces a three-tier “thinking level” system. Users can choose low, medium, or high computational effort, giving more control over the tradeoff between speed and reasoning depth. The previous generation only offered low and high options.

The Price War Continues

At $2 per million input tokens (and $12 for output), Gemini 3.1 Pro undercuts Anthropic’s Opus pricing significantly. Combined with the benchmark improvements, this puts pressure on competitors to either match performance at similar prices or justify premium pricing with capabilities Google can’t match.

The pricing also reflects a strategic reality: Google can subsidize AI costs with advertising revenue in ways that pure-play AI companies cannot. Whether this leads to sustainable competition or eventual market consolidation remains an open question.

What This Means for Users

For developers and enterprises already using Google’s AI stack, the upgrade is essentially free. The model is available now in preview through the Gemini API, Google AI Studio, Vertex AI, and the Gemini app. NotebookLM access requires a Pro or Ultra subscription.

For everyone else, the practical question is whether benchmark improvements translate to real-world utility. A model that scores 77% instead of 68% on abstract reasoning tests might matter for some applications and not at all for others.

Privacy Considerations Remain

Google’s enterprise offerings - Vertex AI and Gemini for Workspace - don’t use customer data to train models. But the consumer Gemini app has different rules. Conversations may be stored for up to 36 months and reviewed by humans to improve model quality. Even deleted conversations can be retained by reviewers for up to three years.

If you’re using Gemini through the free app or basic paid tier, assume your conversations aren’t private. For sensitive work, the enterprise products with their stricter data governance policies remain the safer choice.

The Bigger Picture

February 2026 has been intense for model releases. Anthropic dropped Claude Opus 4.6 earlier this month. Chinese labs have been shipping competitive models at aggressive prices. OpenAI is navigating post-GPT-5 refinements while pursuing record funding rounds.

Google’s response is characteristic: compete on price while matching or exceeding capabilities. It’s a strategy that requires deep pockets and patience, but Google has both. The company is betting that making frontier AI cheap and accessible will drive adoption across its ecosystem - search, cloud, workspace, and beyond.

For users, the relentless pace of improvement creates a different challenge: deciding when capabilities are good enough to build on, versus waiting for the next model that might be better. Gemini 3.1 Pro’s reasoning jump suggests the capability curve hasn’t flattened yet. The question is whether anyone can keep up.