Probing CODI's Latent Reasoning Chain with Logit Lens and Tuned Lens
New interpretability research shows CODI's 6-step reasoning chain stores answers in odd steps and intermediate data in even steps.
New interpretability research reveals how CODI's latent reasoning models internally process multi-step problems, with significant implications for AI safety and monitoring. Researcher Realmbird applied logit lens and tuned lens techniques to probe CODI's 6-step reasoning chain in the publicly available Llama 3.2 1B checkpoint, testing on GSM8K arithmetic problems. The study found that tuned lenses trained directly on CODI's outputs could decode final answers from odd-numbered latent steps (1, 3, 5), while plain logit lenses failed to reveal this information. This suggests these odd steps contain crystallized answers, while even steps (2, 4) serve primarily as storage with the highest final answer detection rates.
The research uncovered surprising patterns in CODI's internal computation: entropy peaks at steps 3 and 5, indicating active computation happening at those stages. Most unexpectedly, translators trained directly on CODI's latent hidden states underperformed those trained on text tokens, suggesting latent vectors remain surprisingly close to text token geometry despite their specialized function. The tuned lens specialized for CODI latents tended to overfit, mirroring the final layer's behavior, while separate analysis of odd and even latents showed distinct decoding patterns. These findings advance our ability to monitor AI reasoning processes in real-time, potentially enabling early exit strategies when answers have crystallized or flagging when internal computations diverge from stated reasoning.
- Tuned lens reveals final answers in CODI's odd latent steps (1,3,5) while plain logit lens cannot decode this information
- Even latent steps (2,4) show highest final answer detection rates (consistent with storage function) while odd steps show entropy peaks (indicating active computation)
- Translators trained directly on CODI's latent states underperformed text-trained translators by 15-20%, suggesting latent vectors remain close to text geometry
Why It Matters
Enables lightweight monitoring tools to detect when AI reasoning diverges from stated logic, crucial for safety-critical applications.