ICML 2026 Paper: LLM Uncertainty Quantification is Just Unsupervised Clustering
Current UQ methods measure consistency, not factual truth—a category error.
A new position paper accepted at ICML 2026, by Chen et al., argues that the entire field of uncertainty quantification (UQ) for large language models suffers from a category error: mainstream UQ methods are just unsupervised clustering algorithms. The authors demonstrate that these methods inherently measure the internal consistency of the model's generations—how similar outputs are to each other under perturbation—rather than their external correctness relative to factual reality. This means current UQ is fundamentally blind to 'confident hallucinations,' where an LLM outputs stable but wrong answers with high apparent confidence. The result is a deceptive sense of safety when deploying models in high-stakes domains.
The paper identifies three critical pathologies from this dependence on internal state. First, a hyperparameter sensitivity crisis: UQ results vary wildly with small changes in sampling parameters, making deployment decisions unreliable. Second, an internal evaluation cycle: researchers validate UQ methods using the model's own internal proxies (like semantic similarity) instead of external ground truth, conflating stability with truth. Third, a fundamental lack of ground truth in UQ evaluation, forcing reliance on unstable proxy metrics. To resolve this impasse, the authors advocate a paradigm shift: adopt better evaluation metrics with labeled ground truth, implement native uncertainty mechanisms (e.g., by modifying model architecture or training), and anchor verification in objective truth. The roadmap explicitly calls for the community to stop treating consistency as confidence.
- UQ methods measure internal consistency (unsupervised clustering) not external correctness, missing confident hallucinations.
- Three identified pathologies: hyperparameter sensitivity, internal evaluation cycle, and lack of ground truth in UQ metrics.
- Paper calls for paradigm shift: objective-truth-anchored evaluation, native uncertainty mechanisms, and dropping proxy metrics.
Why It Matters
For high-stakes AI, current confidence metrics may be dangerously misleading; real safety requires grounding in truth.