ScientistOne’s CoE framework achieved zero hallucinated references across 75 papers, outperforming methods like PaperQA2 and Elicit?

ScientistOne’s CoE framework achieved zero hallucinated references across 75 papers, outperforming methods like PaperQA2 and Elicit.

The framework’s key innovation is coupling evidence verification with autonomous experimentation, not just retrieval and claim extraction?

The framework’s key innovation is coupling evidence verification with autonomous experimentation, not just retrieval and claim extraction.

Generalizability and computational cost remain open questions; the method’s true value will be tested on diverse, real-world research tasks?

Generalizability and computational cost remain open questions; the method’s true value will be tested on diverse, real-world research tasks.

Agent Frameworks

ScientistOne eliminates research hallucinations with Chain-of-Evidence framework

arXiv cs.MA May 27, 2026

⚡ScientistOne’s Chain-of-Evidence framework claims a perfect citation record across 75 scientific papers—a feat that, if it holds, could redefine how autonomous agents contribute to knowledge production.

Deep Dive

Scientists from Google (including Rui Meng, Bhavana Dalvi Mishra, and others) have unveiled ScientistOne, an end-to-end autonomous research agent that tackles the verifiability crisis in AI-generated papers. Existing agents produce convincing manuscripts but suffer from fabricated citations, unreproducible scores, and misaligned methods. To fix this, the team introduced Chain-of-Evidence (CoE), a framework requiring every claim to be traceable to its evidence source, and CoE Audit, a post-hoc verification system with four integrity checks: score verification, specification violation, reference verification, and method-code alignment. Across 75 papers from five baseline systems, hallucinations plagued 21% of references, score passes dropped to 42%, and method-code alignment ranged from 20% to 80%.

ScientistOne dramatically outperformed all baselines: zero hallucinated references (0/337), perfect score verification (12/12), and 14/15 method-code alignment—the highest recorded. It matched or exceeded human expert performance on all five frontier research tasks (e.g., literature review, solution discovery, paper writing). Beyond those tasks, ScientistOne generalized to six additional domains—including medical imaging, fine-grained recognition, 3D perception, and language modeling—achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines failed entirely. The system maintains evidence chains by construction throughout its pipeline, making it the first autonomous researcher capable of producing truly verifiable scientific outputs.

Key Points

ScientistOne’s CoE framework achieved zero hallucinated references across 75 papers, outperforming methods like PaperQA2 and Elicit.
The framework’s key innovation is coupling evidence verification with autonomous experimentation, not just retrieval and claim extraction.
Generalizability and computational cost remain open questions; the method’s true value will be tested on diverse, real-world research tasks.

Why It Matters

Zero-hallucination reference generation could unlock autonomous scientific discovery, but only if it scales beyond curated benchmarks.

Read Original Article

ScientistOne eliminates research hallucinations with Chain-of-Evidence framework

Why It Matters

Related Articles

🚀 Stay Ahead in AI