OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis
A new open-source pipeline synthesizes 97,000 research trajectories offline, boosting model accuracy by 34 points.
A research team led by Zhuofeng Li has introduced OpenResearcher, a fully reproducible pipeline designed to solve a major bottleneck in AI agent development: the lack of high-quality, scalable training data for long-horizon research tasks. Current methods rely on costly and unstable proprietary web APIs. OpenResearcher decouples this process by first bootstrapping a massive 15-million-document corpus, then running an entirely offline "search-and-browse" loop using three explicit browser primitives. This allows for the scalable synthesis of complex research trajectories where an agent must interleave search, evidence gathering, and multi-step reasoning.
Using the powerful GPT-OSS-120B model as a teacher, the team generated over 97,000 synthetic trajectories, including many with over 100 tool calls, representing deep, multi-turn research sessions. Supervised fine-tuning of a 30B-A3B model backbone on this data yielded a massive 34.0 percentage point improvement on the challenging BrowseComp-Plus benchmark, achieving 54.8% accuracy. The model also remained competitive on other key benchmarks like GAIA and xbench-DeepSearch.
The fully instrumented, offline environment is a key innovation, enabling controlled analysis that yielded practical insights for pipeline design. The team studied the impact of data filtering, agent configuration, and how retrieval success correlates with final answer accuracy. Critically, the authors are releasing the complete pipeline, all 97K+ synthesized trajectories, model checkpoints, and the offline search environment, providing a foundational open-source resource for the community.
- Generates 97,000+ long-horizon research trajectories offline using a 15M-document corpus and browser primitives.
- Fine-tuned 30B-A3B model shows a +34.0 point accuracy jump on BrowseComp-Plus, reaching 54.8%.
- Fully open-sources pipeline, data, and environment to enable reproducible research and agent development.
Why It Matters
Provides a scalable, open-source foundation for training capable research AI agents, moving beyond reliance on expensive, proprietary APIs.