NLAs use round-trip validation to ensure verbalised activations faithfully represent model internals?

NLAs use round-trip validation to ensure verbalised activations faithfully represent model internals.

Qwen 2.5 7B generates each digit of a multiplication result as a separate token, aiding algorithm extraction?

Qwen 2.5 7B generates each digit of a multiplication result as a separate token, aiding algorithm extraction.

Early verbalizations show fragments of multiplication logic (e.g., 'product of 12 and 14 is 168') but are often inconsistent or hallucinated?

Early verbalizations show fragments of multiplication logic (e.g., 'product of 12 and 14 is 168') but are often inconsistent or hallucinated.

AI Safety

Anthropic's NLAs reveal Qwen 2.5 7B's multiplication hints but need refinement

LessWrong AI May 17, 2026

⚡Round-trip validation lets activations explain themselves – early multiplication results show promise.

Deep Dive

Anthropic's Neural Language Autoencoders (NLAs) offer a novel way to peek inside language models by translating residual stream activations into natural language descriptions and back again – round-trip validation ensures fidelity. The approach trains two models (encoder and decoder) on a target LLM's activations. In a recent test, an enthusiast applied NLAs to Qwen 2.5 7B (Layer 20) to understand its multiplication algorithm.

The experiment revealed that Qwen reliably outputs each digit of a product as a single token, simplifying pattern recognition. However, the activation vocalisations were mixed: for '7 × 24' yielding '168', descriptions like 'The product of 9 and 13 is 1' appeared, suggesting the model partially ties digits to multiplication facts but lacks clean structure. Other verbalizations were garbled or hallucinated contexts (e.g., 'to find the product of 8 and 2...'). The author concludes NLAs have promise for interpretability but require more training data and layer-specific tuning to yield coherent explanations.

Key Points

NLAs use round-trip validation to ensure verbalised activations faithfully represent model internals.
Qwen 2.5 7B generates each digit of a multiplication result as a separate token, aiding algorithm extraction.
Early verbalizations show fragments of multiplication logic (e.g., 'product of 12 and 14 is 168') but are often inconsistent or hallucinated.

Why It Matters

This method could unlock faithful AI interpretability, but current results show it needs major refinement for practical use.

Read Original Article

Anthropic's NLAs reveal Qwen 2.5 7B's multiplication hints but need refinement

Why It Matters

Related Articles

🚀 Stay Ahead in AI