AI Safety

Latent Reasoning Sprint #2: Token-Based Signals and Linear Probes

New interpretability research shows Llama 3.2 1B's latent step 3 triggers 'therefore' signaling, revealing asymmetric reasoning patterns.

Deep Dive

Independent AI interpretability researcher Realmbird has published the second installment of their 'Latent Reasoning Sprint' series, providing new insights into the internal reasoning mechanisms of the CODI Llama 3.2 1B model. The research builds on previous findings about the model's 'compute/store alternation' pattern, where even-numbered latent steps appear to store intermediate answers while odd-numbered steps perform computation. This latest analysis reveals that the tuned logit lens—a tool for interpreting model activations—fails to generalize beyond CODI-specific data, limiting its utility for broader mechanistic interpretability. The study also confirms that final answer detection remains robust at even steps 2 and 4 across different top-k sampling values (1-10), ruling out potential threshold artifacts.

One of the most intriguing discoveries is the specific emergence of the token 'therefore' in the model's predictions only after latent step 3, with detection rates increasing through step 6. This suggests step 3 serves a distinct 'conclusion-signaling' role compared to step 5, despite both being odd computation steps. Realmbird trained a linear probe—a simple classifier—to distinguish between intermediate and final answer representations within the model's activations. The probe activated most strongly at odd steps 3 and 5, consistent with the compute/store hypothesis, but showed step 3 peaking higher than step 5, revealing previously unknown asymmetry in the reasoning process. This asymmetry may explain why patching the final two latent vectors with random noise doesn't degrade accuracy, as the model appears to commit to an answer earlier in the chain.

Key Points
  • CODI Llama 3.2 1B's tuned logit lens fails to generalize to non-CODI activations, limiting interpretability tool utility.
  • The token 'therefore' emerges specifically after latent step 3, suggesting step 3 triggers conclusion signaling distinct from step 5's computation.
  • A linear probe for final answer detection activates most at odd steps 3 & 5, confirming compute/store alternation with 3 peaking higher than 5.

Why It Matters

This research advances mechanistic interpretability, helping developers understand and debug complex reasoning in language models like Llama 3.2.