Media & Culture

MiroThinker H1 tops GPT 5.4, Claude 4.6 Opus on BrowseComp; its 3B param open source variant beats GPT 5 on GAIA

Open-source 3B parameter variant outperforms GPT-5 on GAIA benchmark with novel verification architecture.

Deep Dive

MiroMind's new MiroThinker H1 has taken the lead on the BrowseComp benchmark for web browsing agents, scoring 88.2 and outperforming established giants like Gemini 3.1 Pro (85.9), Claude 4.6 Opus (84.0), and GPT-5.4 (82.7). The performance gap widens on the GAIA benchmark, where MiroThinker H1 scores 88.5 versus GPT-5's 76.4. While it doesn't dominate all benchmarks—Gemini 3 Pro leads on SUPERChem and Claude is ahead on DeepSearchQA—the results establish MiroThinker as a formidable specialist for agentic web browsing tasks.

The breakthrough isn't just in raw scores but in a novel verification architecture that dramatically improves efficiency. The system employs a 'Local Verifier' that forces the agent to explore reasoning paths more thoroughly at each step. On a challenging subset of 295 BrowseComp questions, this mechanism improved pass@1 accuracy from 32.1% to 58.5% while slashing the average number of interaction steps from 1,185.2 to just 210.8—nearly doubling accuracy using roughly one-sixth the steps. This suggests a paradigm shift for agent design: verifying more and exploring less wastefully can be more effective than simply extending reasoning chains.

Perhaps most striking is the performance of the smaller, open-source variant. MiroThinker 1.7 mini, running on just 3B activated parameters from a Qwen3 MoE architecture, scores 80.3 on GAIA, still beating GPT-5. This raises significant questions about the source of agentic performance, suggesting that architecture and training methodology—particularly this verification-centric approach—may rival raw parameter count in importance. The model weights are available on HuggingFace, inviting further community testing and development.

Key Points
  • MiroThinker H1 scores 88.2 on BrowseComp, beating GPT-5.4 (82.7) and Claude 4.6 Opus (84.0)
  • Its 3B parameter open-source variant scores 80.3 on GAIA, outperforming GPT-5's 76.4
  • Novel 'Local Verifier' architecture improves accuracy from 32.1% to 58.5% while cutting interaction steps by 82%

Why It Matters

Demonstrates that efficient agent architecture can outperform massive models, potentially lowering compute costs for complex AI tasks.