AI Safety

Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens

New research shows KV cache steering can boost accuracy, while embedding steering fails to beat random noise.

Deep Dive

Researcher Realmbird has published the third installment in a series investigating latent reasoning in language models, focusing on activation difference steering and the logit lens. The experiments were conducted using the publicly available CODI Llama 3.2 1B checkpoint, a model designed for chain-of-thought alignment. A key finding is that a tuned logit lens—a tool for probing model internals—often fails to locate a prompt's final answer in a consistent layer, instead producing close approximations. For example, when asked how many of 600 employees didn't get a promotion or bonus, the lens output numbers like 720 or 350 instead of the correct answer, 360.

The study tested two steering methods: embedding steering and KV cache steering. Steering the average difference between hidden states from different latent vectors did not improve accuracy and performed similarly to patching in random noise, suggesting the difference vectors may be too noisy. In contrast, directly steering the model's Key-Value (KV) cache—the memory of previous tokens—during answer generation showed a measurable increase in accuracy. This result provides a practical method for potentially influencing a model's latent reasoning process and offers new evidence for the 'compute/store alternation' hypothesis of how models perform internal calculations.

Key Points
  • Tuned logit lens on CODI Llama 3.2 1B finds answer approximations (e.g., 720) but not exact final answers consistently.
  • Steering the model's KV cache during answer generation successfully increased output accuracy in experiments.
  • Embedding steering using average hidden state differences failed to improve performance, matching results from random vector patching.

Why It Matters

Provides concrete methods for interpreting and steering internal model reasoning, a key step toward more controllable and transparent AI systems.