MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs
Medical AI hallucinations just got a new enemy—word-level detection that outperforms SOTA by 15%.
A team of researchers led by Tung Sum Thomas Kwok and colleagues has introduced MedFabric and EtHER, a data-centric framework designed to generate and detect word-level fabrications in medical large language models (LLMs). The problem is critical: LLMs often produce fluent but factually incorrect statements in expert domains like medicine, a phenomenon called hallucination. Existing datasets for detecting such fabrications suffer from limited coverage, stylistic differences between human and AI texts, and distributional drift. MedFabric addresses this with a pipeline that creates realistic, subtle factual deviations while preserving syntax and style.
Building on this dataset, EtHER is a modular detector combining three components: Text2Table Decomposition (structuring text into tables for easier comparison), Word Masking and Filling (identifying suspect tokens), and Hybrid Sentence Pair Evaluation (assessing factual alignment). Empirical results show EtHER beats state-of-the-art detectors by over 15% on word-level fabrication benchmarks, maintaining performance across structurally similar sentences. This framework provides a reliable way to ensure medical LLMs stay factually accurate, a critical step for clinical applications.
- MedFabric dataset generates realistic word-level medical fabrications with subtle factual deviations, improving detection training data.
- EtHER detector uses Text2Table Decomposition, Word Masking/Filling, and Hybrid Sentence Pair Evaluation for modular factuality checking.
- Outperforms existing state-of-the-art detectors by over 15% on word-level fabrication benchmarks.
Why It Matters
For healthcare AI, catching subtle fabrications at the word level prevents dangerous misinformation in clinical decisions.