Introduced δ(ξ) log-odds code to measure when LLMs commit to an answer without needing rollout or probes?

Introduced δ(ξ) log-odds code to measure when LLMs commit to an answer without needing rollout or probes.

On Qwen3-4B-Instruct, answer preference stabilizes 17–31 tokens before the answer is parseable?

On Qwen3-4B-Instruct, answer preference stabilizes 17–31 tokens before the answer is parseable.

Signal tracks eventual output (not truth), is linearly recoverable from hidden states, and is separable from token cursor progress?

Signal tracks eventual output (not truth), is linearly recoverable from hidden states, and is separable from token cursor progress.

Research & Papers

Researchers pinpoint when LLMs commit to answers 17–31 tokens before speaking

arXiv cs.AI May 11, 2026

⚡Qwen3-4B-Instruct reveals its answer preference stabilizes long before the visible output begins.

Deep Dive

A new paper on arXiv (2605.06723) tackles a fundamental question: when does a language model actually “decide” its answer? The authors propose a formal framework called finite-answer preference stabilization. For a given model state and answer verbalizers (e.g., “yes”/“no”), they compute δ(ξ) = S_θ(yes|ξ) − S_θ(no|ξ), an exact log-odds code that tracks the model’s internal preference before any token is uttered. This avoids greedy rollouts or learned probes, offering a clean theoretical object to study answer commitment.

Testing on Qwen3-4B-Instruct in controlled delayed-verdict tasks, the authors show that δ stabilizes 17–31 tokens before the answer becomes parseable in the main templates. The signal tracks the model’s eventual output (not truth) and is linearly recoverable from compact hidden-state summaries. Importantly, the effect is partially separable from cursor (token position) progress, and it transfers as shared information without requiring a single invariant coordinate. Diagnostics confirm the measurement is distinct from online stopping, verbalizer-free belief, and causal answer control. While local steering shows sensitivity of δ, it does not yield reliable generation control—hinting at deeper challenges for interpretability.

Key Points

Introduced δ(ξ) log-odds code to measure when LLMs commit to an answer without needing rollout or probes.
On Qwen3-4B-Instruct, answer preference stabilizes 17–31 tokens before the answer is parseable.
Signal tracks eventual output (not truth), is linearly recoverable from hidden states, and is separable from token cursor progress.

Why It Matters

Provides a principled measure of internal decision timing in LLMs, advancing interpretability and trustworthiness.

Read Original Article

Researchers pinpoint when LLMs commit to answers 17–31 tokens before speaking

Why It Matters

Related Articles

🚀 Stay Ahead in AI