Anthropic's NLAs reveal Qwen 2.5 7B's multiplication hints but need refinement
Round-trip validation lets activations explain themselves – early multiplication results show promise.
Anthropic's Neural Language Autoencoders (NLAs) offer a novel way to peek inside language models by translating residual stream activations into natural language descriptions and back again – round-trip validation ensures fidelity. The approach trains two models (encoder and decoder) on a target LLM's activations. In a recent test, an enthusiast applied NLAs to Qwen 2.5 7B (Layer 20) to understand its multiplication algorithm.
The experiment revealed that Qwen reliably outputs each digit of a product as a single token, simplifying pattern recognition. However, the activation vocalisations were mixed: for '7 × 24' yielding '168', descriptions like 'The product of 9 and 13 is 1' appeared, suggesting the model partially ties digits to multiplication facts but lacks clean structure. Other verbalizations were garbled or hallucinated contexts (e.g., 'to find the product of 8 and 2...'). The author concludes NLAs have promise for interpretability but require more training data and layer-specific tuning to yield coherent explanations.
- NLAs use round-trip validation to ensure verbalised activations faithfully represent model internals.
- Qwen 2.5 7B generates each digit of a multiplication result as a separate token, aiding algorithm extraction.
- Early verbalizations show fragments of multiplication logic (e.g., 'product of 12 and 14 is 168') but are often inconsistent or hallucinated.
Why It Matters
This method could unlock faithful AI interpretability, but current results show it needs major refinement for practical use.