Binghamton researchers develop voting protocol to eliminate AI hallucinations in medical diagnosis
Seven LLMs vote on answers, achieving 76.85% consensus with zero false information.
Binghamton University researchers have developed a novel verification protocol that virtually eliminates AI hallucinations in medical diagnosis. The method, published in STAR Protocols, uses seven open-source large language models (LLMs) forced to employ retrieval-augmented generation (RAG) — meaning they must reference an authoritative medical terminology database before responding. Over 10,000 experiments, each chatbot received the same plain-language symptoms and generated standardized medical terms with official identification numbers. The models then 'voted' on the correct diagnosis. Results showed 76.85% of answers were supported by at least four LLMs, and all remaining answers had backing from at least two models. No unmatched terms — and critically, no hallucinations. The work was funded by a $100,000 grant from New York State's Empire AI Consortium.
This protocol's key advantage is reproducibility: since there are hundreds of open-source LLMs, researchers can randomly select seven models for each experiment, run it many times, and steadily increase confidence. Beyond diagnosis, the team envisions applications in verifying adverse drug reactions from clinical trials, scientific literature, and pharmacological databases. It could also support 'digital twins' for precision medicine — dynamic AI simulations of human physiology that help optimize treatments before real-world testing. The researchers have already started piloting multi-layer models for ER+ breast cancer. This voting-based approach marks a practical step toward trustworthy AI in healthcare, where even a single hallucination can have serious consequences.
- Protocol uses seven open-source LLMs with RAG to reference a standard medical database.
- 76.85% of answers were supported by at least four models; zero hallucinations across 10,000 experiments.
- Method is reproducible by randomly reselecting LLMs, increasing confidence with each iteration.
Why It Matters
Boosts trust in AI-assisted diagnosis by eliminating false information, enabling safer clinical decision support and precision medicine.