Research & Papers

VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering

A new open-source system validates every AI-generated medical claim against evidence, outperforming GPT-4 on accuracy benchmarks.

Deep Dive

A research team led by Miloš Košprdić and Nikola Milošević has introduced VerifAI, a fully open-source search engine designed specifically for biomedical question answering. Unlike standard retrieval-augmented generation (RAG) systems, VerifAI integrates a novel post-hoc verification mechanism that breaks down generated answers into individual claims and validates each one against the retrieved evidence using a fine-tuned natural language inference (NLI) model. The system's hybrid information retrieval module is optimized for biomedical literature, achieving a MAP@10 score of 42.7%. Crucially, its verification component outperformed GPT-4 on the HealthVer benchmark for detecting hallucinations, significantly reducing the rate of incorrect or unsupported citations.

VerifAI's architecture is modular, consisting of three core components: the retrieval module, a citation-aware generative model fine-tuned on a custom dataset, and the verification engine. The team has open-sourced the entire pipeline—including code, models, and datasets—to promote reliable AI deployment in critical domains like healthcare. This transparency allows every claim in an answer to be traced back to its source, providing a verifiable lineage that is essential for building trust. The system represents a major step toward accountable AI by not just generating answers but actively proving their factual consistency, a necessity for applications where errors can have serious consequences.

Key Points
  • VerifAI's verification component outperforms GPT-4 on the HealthVer benchmark for detecting AI hallucinations.
  • The system uses a fine-tuned NLI engine to validate decomposed claims against evidence, ensuring factual consistency.
  • The entire pipeline is open-source, including code and models, to enable reliable deployment in high-stakes biomedical applications.

Why It Matters

It provides a transparent, verifiable standard for AI in medicine, where factual errors are unacceptable.