How NVIDIA AI-Q Reached \#1 on DeepResearch Bench I and II
NVIDIA's open-source AI-Q agent scored 55.95 and 54.50 on both major research benchmarks, beating specialized models.
NVIDIA's AI-Q, an open blueprint for building enterprise research agents, has taken the top spot on both primary benchmarks for evaluating deep research AI. Scoring 55.95 on DeepResearch Bench I and 54.50 on DeepResearch Bench II, the system demonstrates that a single, configurable stack can lead in both polished report generation and granular factual accuracy. The win is significant because the two benchmarks test complementary skills: Bench I evaluates comprehensive, well-structured narratives, while Bench II uses over 70 binary rubrics to assess information recall, analysis, and presentation.
At its core, AI-Q is a modular, multi-agent architecture built for portability and inspection. It utilizes NVIDIA's NeMo Agent Toolkit and fine-tuned Nemotron 3 Super models—trained on roughly 67,000 supervised fine-tuning trajectories—to power a pipeline of specialized agents. The workflow features an orchestrator that manages the research loop, a planner that creates evidence-grounded research plans, and a researcher that dispatches parallel specialists to gather and synthesize information. An optional ensemble layer runs multiple agents in parallel and merges their outputs for maximum report quality and coverage.
The system's success hinges on four key ingredients: its evidence-grounded, multi-agent design; the fine-tuned Nemotron 3 Super model; custom middleware for long-horizon reliability; and integration with tools like Tavily for web search and Serper for academic papers. This combination allows the agent to perform multi-step research (plan → gather → synthesize) and deliver citation-backed reports. By providing this as an open blueprint, NVIDIA enables enterprises to own, customize, and deploy state-of-the-art research capabilities without being tied to a closed API, marking a step toward more accessible and portable agentic AI.
- Achieved #1 on both DeepResearch Bench I (55.95) and II (54.50), leading in both narrative quality and factual rigor.
- Built on an open, multi-agent architecture using fine-tuned NVIDIA Nemotron 3 Super models and the NeMo Agent Toolkit.
- Enterprises can fully own, inspect, and customize the configurable stack for evidence-based research with proper citations.
Why It Matters
Provides enterprises with a state-of-the-art, open-source alternative to closed AI research agents, enabling customizable and auditable deep research workflows.