Google's Gemini 3 Deep Think Solves 18 Unsolved Research Problems and Disproves a Decade-Old Conjecture

Google's upgraded reasoning model finds flaws in peer-reviewed papers, optimizes semiconductor fabrication, and outperforms every frontier model on scientific benchmarks.

Google released a major upgrade to Gemini 3 Deep Think on February 12, setting new records on scientific benchmarks and, more unusually, producing results that have already made it into peer-reviewed research papers.

The model didn’t just score well on tests. It disproved a mathematical conjecture that had stood since 2015, identified errors in published research papers that human reviewers missed, and helped optimize semiconductor fabrication at Duke University. Roughly half of the 18 research problems it contributed to solving are targeting strong academic conferences, including an accepted paper at ICLR 2026.

This isn’t the usual “new model beats old model on benchmarks” story. It’s closer to an AI system doing actual scientific work.

The Numbers

Deep Think’s benchmark scores read like someone trying to show off:

  • 48.4% on Humanity’s Last Exam (without external tools) - a test designed to be near-impossible for current AI
  • 84.6% on ARC-AGI-2, verified by the ARC Prize Foundation, measuring the ability to learn new skills and generalize
  • 3,455 Elo on Codeforces, placing it in the Legendary Grandmaster tier
  • 50.5% on the CMT-Benchmark for advanced theoretical physics
  • Gold-medal-level results on the written sections of the 2025 International Physics Olympiad and International Chemistry Olympiad
  • 90% on IMO-ProofBench Advanced tests

According to reporting from Winbuzzer, the model outperforms both Anthropic’s Claude Opus 4.6 and OpenAI’s GPT-5.2 across these scientific reasoning tasks.

But benchmarks are benchmarks. What’s more interesting is what happened when researchers actually used it.

Actual Research, Not Just Tests

Disproving a decade-old conjecture. Deep Think constructed a specific three-item combinatorial counterexample to disprove a conjecture from 2015 about online submodular optimization. That’s not pattern matching - that’s constructing a proof that a widely-held mathematical assumption was wrong.

Finding what peer reviewers missed. Lisa Carbone, a mathematician at Rutgers University, used Deep Think to analyze a technical paper exploring the intersection of Einstein’s theory of gravity and quantum mechanics. The model identified a subtle logical flaw that had passed through human peer review unnoticed.

Optimizing semiconductor fabrication. Researchers at Duke University’s Wang Lab used it to design recipes for growing thin-film semiconductors larger than 100 micrometers - a target they’d been struggling to reach.

Autonomous mathematical research. Deep Think’s associated agent, Aletheia, generated a complete research paper on eigenweights in arithmetic geometry without human intervention. It also analyzed 700 open problems from the Erdős Conjecture Database, autonomously solving four questions, including Erdős-1051, which led to a generalization published in a peer-reviewed paper.

Physics applications. The model calculated gravitational radiation from cosmic strings using Gegenbauer polynomials, resolving singularities into a closed-form, finite sum. It also solved Max-Cut and Steiner Tree network optimization problems by applying continuous mathematics tools to discrete algorithmic challenges.

The collaboration model Google describes is what it calls the “Advisor” approach: experts guide the AI through iterative “vibe-proving” cycles, using balanced prompting techniques to prevent confirmation bias. It’s a far cry from “ask ChatGPT to write my essay.”

How It Works

Deep Think uses what Google calls “System 2” thinking - slower, deliberate multi-step reasoning rather than quick responses. The model was developed in partnership with scientists and designed specifically for problems that lack clear guardrails or single correct solutions, and where data is often messy or incomplete.

One notable addition: the model can analyze hand-drawn sketches and convert them into 3D-printable objects with functional code generation. The Aletheia agent also recognizes when mathematical problems cannot be solved with current approaches, saving researchers from chasing dead ends.

What This Means

The shift here isn’t incremental. Previous AI models were useful for literature review, data analysis, and code generation. Deep Think is doing something qualitatively different - contributing novel mathematical results, catching errors in peer-reviewed work, and solving problems that humans were stuck on.

Google is positioning this as a “force multiplier” for researchers rather than a replacement, and the emphasis on human-AI collaboration in the research process backs that up. But the Aletheia agent generating a complete paper autonomously suggests the line between “assistant” and “contributor” is getting thin.

The practical implications are significant for anyone in academic research, particularly in mathematics, physics, and materials science. Having a tool that can spot logical flaws in papers and construct counterexamples to established conjectures changes the speed at which mathematical research can move.

Availability

Deep Think is available to Google AI Ultra subscribers through the Gemini app. API access is available through an early access program for select researchers and enterprises. No pricing details were announced for the API tier.

The Bigger Picture

Google’s approach with Deep Think represents a different bet than what we’ve seen from OpenAI or Anthropic recently. Instead of competing on general-purpose chat or coding assistants, Google is building a tool specifically for the hardest kind of intellectual work - frontier research problems where the answer isn’t known.

Whether this translates into actual scientific acceleration will depend on adoption. The results so far, with papers submitted to ICLR and findings published in journals, suggest this isn’t vaporware. But the model is currently locked behind a premium subscription and limited API access, which narrows who can use it.

The more interesting question is whether other AI labs will follow Google into the “AI for science” lane, or whether scientific research becomes a differentiating feature for one provider. Given that Gemini 3 Pro already tops the LMArena leaderboard with a 1,501 Elo score, Google seems to be pulling away on the research front even as the general-purpose AI race remains crowded.