MiroThinker 72B: The Open-Source Research Agent That Outperforms GPT-5

An open-source AI agent using interactive scaling beats OpenAI's GPT-5-high on Humanity's Last Exam. Here's what makes it different.

Code displayed on a laptop screen with warm lighting

A few months ago, matching GPT-5 on reasoning benchmarks required an API subscription to a frontier lab. That changed this week when MiroMind AI released MiroThinker 72B, an open-source research agent that beats OpenAI’s GPT-5-high on Humanity’s Last Exam and achieves 81.9% on GAIA—putting it in frontier territory for complex reasoning tasks.

The catch? There isn’t one. The model weights are on HuggingFace, the code is on GitHub, and you can run it locally if you have the hardware.

The Numbers

MiroThinker 72B posts strong results across multiple benchmarks:

BenchmarkMiroThinker 72BGPT-5-high
GAIA-Val-16581.9%~82%
Humanity’s Last Exam (HLE)37.7%35.2%
BrowseComp47.1%
BrowseComp-ZH55.6%

The GAIA benchmark tests general AI assistant capabilities requiring reasoning, multi-modality, web browsing, and tool use. Humanity’s Last Exam (HLE) is designed to be extremely difficult—questions contributed by experts across fields to challenge AI systems. MiroThinker’s 37.7% beats GPT-5-high’s 35.2% on the text-only subset.

What Makes It Different

MiroThinker doesn’t just scale model size. The team at MiroMind AI introduced what they call “interactive scaling”—a third dimension of performance improvement alongside parameters and context length.

The concept: train the model to handle deeper and more frequent interactions with its environment. Rather than generating answers directly, MiroThinker runs verification cycles, calls tools to gather information, and refines its reasoning based on feedback. The arXiv paper shows performance gains of 8-10 points as interaction depth increases.

The technical setup supports this approach:

  • 256K context window for extended reasoning chains
  • Up to 600 tool calls per task for deep research
  • Recency-based context retention that preserves recent observations while managing memory

The training pipeline combines supervised fine-tuning, direct preference optimization (DPO), and group relative policy optimization (GRPO) reinforcement learning. The result is a model that learns when to explore versus when to commit to an answer.

Why This Matters

Open-source models have been narrowing the gap with commercial alternatives for a while now. What makes MiroThinker notable is where it’s competitive: complex reasoning and research tasks that require sustained multi-step analysis.

Previous open models excelled at simpler benchmarks while struggling on tasks requiring extended chains of thought. MiroThinker’s interactive scaling approach suggests a path forward that doesn’t require ever-larger parameter counts—it requires smarter interaction patterns.

For researchers and developers, this means frontier-level research capabilities without API costs or usage restrictions. For organizations concerned about data privacy, it means keeping sensitive research queries on local infrastructure.

Running It Yourself

MiroThinker comes in multiple sizes:

  • MiroThinker-1.7-mini (30B) — Fits on consumer GPUs with quantization
  • MiroThinker-72B — Requires ~140GB VRAM at full precision, or ~40GB quantized
  • MiroThinker-235B — Datacenter scale

The 72B version is the sweet spot for most users with capable hardware. A quantized version should run on dual 3090s or a single 4090 with aggressive quantization.

The GitHub repository includes tool integrations for web search, code execution, and file operations. You’ll need to configure API keys for any external tools you want the agent to use.

What This Means

The frontier model moat is leaking. MiroThinker demonstrates that sophisticated training approaches can compensate for smaller parameter counts, and that open-source projects can deliver research-grade reasoning capabilities.

This doesn’t mean GPT-5 is obsolete—OpenAI’s model still leads on many benchmarks and offers convenience that local deployment can’t match. But for research tasks where you need transparency, reproducibility, or privacy, MiroThinker is now a credible alternative.

The broader trend: AI reasoning capabilities are becoming commoditized. Models that seemed impossibly advanced a year ago now run on prosumer hardware. If you’re building applications that require complex reasoning, the cost of that capability just dropped significantly.

What You Can Do

If you have capable hardware:

  1. Download from HuggingFace
  2. Start with the 30B version to test the workflow
  3. Configure tool integrations for your research needs

If you don’t have the hardware:

  • Wait for hosted inference options (several providers are already setting up MiroThinker endpoints)
  • Try the online demo to evaluate capabilities

For security-sensitive research:

  • MiroThinker’s tool-calling capabilities introduce the same risks as any AI agent with system access
  • Sandbox thoroughly before connecting to production systems
  • The model has no built-in safeguards against harmful queries—that’s your responsibility

The message is clear: the era of open-source frontier AI agents has arrived. Whether that’s exciting or concerning depends on your perspective. Either way, it’s happening.