Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs
New study finds AI hallucination detectors fail when switching between legal, financial, and science domains.
A new research paper by Snehit and Pujith Vaddi challenges the idea of a universal neural signature for AI hallucinations. The study builds on prior work identifying sparse 'hallucination neurons' (H-neurons)—less than 0.1% of feed-forward network neurons—that predict when large language models (LLMs) will fabricate information. The key question was whether these neurons generalize across different types of knowledge.
To test this, the researchers implemented a systematic cross-domain transfer protocol. They evaluated five open-weight models, ranging from 3B to 8B parameters, across six distinct domains: general question-answering, legal, financial, science, moral reasoning, and code vulnerability. The results were clear: classifiers trained to detect hallucinations using H-neurons from one domain failed to maintain performance when applied to another. Within-domain performance measured an AUROC of 0.783, but this plummeted to 0.563 on cross-domain transfer—a statistically significant degradation of 0.220 (p < 0.001) that was consistent across all models tested.
This finding has major technical implications. It suggests that hallucination is not a single, monolithic mechanism with a universal neural footprint. Instead, the neural populations involved in generating incorrect information differ depending on the specific knowledge domain being queried. The brain of an LLM appears to have specialized 'circuits' for making things up about law versus science versus code.
Consequently, the research directly impacts how developers might build and deploy practical hallucination mitigation tools. The study concludes that neuron-level hallucination detectors cannot be trained once and applied universally. For effective deployment in real-world applications—like legal document review or financial analysis—these detectors must be carefully calibrated for each specific knowledge domain, adding a layer of complexity to AI safety efforts.
- Hallucination neuron classifiers suffer a 0.220 AUROC drop (from 0.783 to 0.563) when transferred between domains like legal and science.
- Tested on 5 open-weight LLMs (3B to 8B parameters) across 6 knowledge domains, showing the effect is model-agnostic.
- Proves hallucination is not a universal mechanism, forcing domain-specific calibration for neuron-level detectors.
Why It Matters
Forces a rethink of one-size-fits-all AI safety tools, requiring specialized detectors for high-stakes domains like law and finance.