Research & Papers

New probe-targeted LoRA fine-tuning makes LLMs reveal true confidence

LLMs know when they're wrong but won't say so—until now.

Deep Dive

Probe-targeted fine-tuning (LoRA) can teach an instruct-tuned LLM to verbalize its internal confidence. Hidden states already discriminate correct from incorrect answers (AUROC 0.76–0.88), but the model's direct verbal response defaults to 99% confidence. By using probe outputs as fine-tuning targets, the method requires only a few hundred examples and under 10 minutes on an M3 Ultra, tested across 8 models from 4 families (7B–70B). Activation patching confirms causality: swapping hidden states at the confidence position shifts confidence (ρ = 0.976 layer gradient), while random swaps do nothing. However, at 70B the softmax distribution carries valid metacognitive signal, but the argmax text remains stuck at 99% confident — the model learned internal routing but cannot pass the text bottleneck. Seed-level replication across 3 models shows discrimination is stable, but confidence distribution shape is seed-sensitive. Pre-registered across 2 studies (with noted deviations); code and pre-print available.

Key Points
  • Hidden states already detect correctness with 0.76–0.88 AUROC, but verbal outputs default to 99% confidence.
  • Fine-tuning with probe targets takes under 10 minutes on an M3 Ultra using LoRA with only a few hundred examples.
  • Activation patching confirms causal link: swapping states at confidence position shifts output (ρ = 0.976), random positions have no effect.

Why It Matters

Forces AI models to honestly indicate uncertainty, building trust in high-stakes applications like medical diagnosis and legal analysis.