Researchers pinpoint when LLMs commit to answers 17–31 tokens before speaking
Qwen3-4B-Instruct reveals its answer preference stabilizes long before the visible output begins.
A new paper on arXiv (2605.06723) tackles a fundamental question: when does a language model actually “decide” its answer? The authors propose a formal framework called finite-answer preference stabilization. For a given model state and answer verbalizers (e.g., “yes”/“no”), they compute δ(ξ) = S_θ(yes|ξ) − S_θ(no|ξ), an exact log-odds code that tracks the model’s internal preference before any token is uttered. This avoids greedy rollouts or learned probes, offering a clean theoretical object to study answer commitment.
Testing on Qwen3-4B-Instruct in controlled delayed-verdict tasks, the authors show that δ stabilizes 17–31 tokens before the answer becomes parseable in the main templates. The signal tracks the model’s eventual output (not truth) and is linearly recoverable from compact hidden-state summaries. Importantly, the effect is partially separable from cursor (token position) progress, and it transfers as shared information without requiring a single invariant coordinate. Diagnostics confirm the measurement is distinct from online stopping, verbalizer-free belief, and causal answer control. While local steering shows sensitivity of δ, it does not yield reliable generation control—hinting at deeper challenges for interpretability.
- Introduced δ(ξ) log-odds code to measure when LLMs commit to an answer without needing rollout or probes.
- On Qwen3-4B-Instruct, answer preference stabilizes 17–31 tokens before the answer is parseable.
- Signal tracks eventual output (not truth), is linearly recoverable from hidden states, and is separable from token cursor progress.
Why It Matters
Provides a principled measure of internal decision timing in LLMs, advancing interpretability and trustworthiness.