Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
New method trains probes on LLM hidden states to spot lies from within, adding just 1.55ms of latency.
A team of researchers has developed a novel method to tackle AI hallucination by moving detection from external systems to the model's own internal circuitry. Their paper, "Weakly Supervised Distillation of Hallucination Signals into Transformer Representations," proposes training lightweight "probes" on a model's hidden states. These probes learn to identify when the model is generating unsupported information, a process traditionally requiring gold-standard answers, retrieval systems, or separate judge models at runtime. The team created a 15,000-sample dataset by using LLaMA-2-7B to generate answers to SQuAD v2 questions and then applying a trio of weak supervision signals—substring matching, sentence similarity, and an LLM judge—to label each output as grounded or hallucinated.
They then trained five different probing architectures on the paired hidden states and labels. The key result is that transformer-based probes, particularly their CrossLayerTransformer (M2) and HierarchicalTransformer (M3) models, successfully learned the detection task. This proves the central hypothesis: signals for hallucination can be distilled into the model's representations during training. At inference, this means detection happens from the model's internal activations alone, eliminating the need for costly external verification. The practical overhead is minimal, with probe latency ranging from 1.55 to 6.66 milliseconds per sample, and end-to-end throughput remaining at approximately 0.231 queries per second. This represents a shift towards more self-aware, efficient, and trustworthy language models.
- Uses weak supervision from 3 sources (substring match, embedding similarity, LLM judge) to create a 15k-sample dataset from LLaMA-2-7B outputs.
- Trains 5 probing classifiers directly on transformer hidden states, with the best (M2 & M3) achieving detection without external tools at inference.
- Adds negligible latency of 1.55-6.66ms per sample, enabling internal fact-checking that maintains a throughput of ~0.231 queries/second.
Why It Matters
Enables AI models to self-detect inaccuracies in real-time, making them more reliable and efficient without adding complex external systems.