AI Safety

NLA Thought Anchors: Why AI's correct answers are better decoded from specific token positions

First sentence is key: NLA reconstruction loss lower for correct model responses by 30%

Deep Dive

Realmbird's investigation into NLA (Natural Language Autoencoders) uncovers how model correctness influences interpretability. Using Qwen2.5-7B-Instruct as the base model and pretrained NLAs from kitft's repository, the researcher found that extraction position critically determines whether the NLA's output contains the model's final answer. Specifically, the probability of the final answer appearing in the autoencoder vector (AV) increases as the token position approaches the model's final answer token. For correct activations, the answer appears at significantly higher rates compared to incorrect activations, with the first sentence of the explanation being the most counterfactually important for both reconstruction loss and containing the final answer.

Further analysis shows that incorrect activations lead to degenerate NLA outputs—repetition, garbled tokens, emoji blocks—which never appear for correct activations. The NLA response length also varies more for incorrect activations, indicating model uncertainty. Reconstruction loss is approximately 30% higher for incorrect activations on the GSM8K dataset. The counterfactual importance of sentences is more evenly spread for incorrect activations, whereas correct activations heavily weight the first sentence. These findings support Ryan Greenblatt's earlier result that NLA output contains what the AI will predict at a rate much higher than chance for both correct and incorrect problems, but with nuanced differences in quality and position sensitivity.

Key Points
  • NLA answer appearance in AV increases as the token approaches the model's final answer, with correct activations showing significantly higher rates.
  • Degenerate outputs (repetition, garbled tokens, emoji blocks) only occur for activations from incorrect model responses, never for correct ones.
  • Counterfactual importance is concentrated in the first sentence for correct activations, but spreads evenly for incorrect activations; reconstruction loss is 30% higher for incorrect activations.

Why It Matters

Improves AI interpretability by showing how to reliably extract model reasoning, distinguishing correct from flawed internal states.