You can now read Gemma 3's mind
Click any token in Gemma 3's output to see what it was thinking in plain English.
Anthropic, in collaboration with Neuronpedia, has released Natural Language Autoencoders (NLA) for Gemma 3 27b instruct, a breakthrough in AI interpretability. The system uses two specialized LLMs: an Auto Verbalizer (AV) that translates the model's internal activations into human-readable text for any specific token, and an Activation Reconstructor (AR) that verifies the translation by attempting to reconstruct the original activations from the generated text. This two-model approach ensures both interpretability and faithfulness. The weights are available on Hugging Face, and Neuronpedia hosts an interactive demo where users can input a prompt, click any token in the response, and get an explanation of what the model was thinking when generating that token. For example, when prompted with "I am Elon Musk", the model's internal thoughts for the first tokens labeled the content as fabricated and satirical, revealing an immediate bias detection mechanism.
This tool marks a significant step toward black-box transparency in large language models. By providing a window into the token-level reasoning process, developers and researchers can now debug unexpected outputs, identify hidden biases, and verify alignment without relying on post-hoc rationalizations. The NLA technique is model-agnostic and could be extended to other architectures beyond Gemma 3, paving the way for more trustworthy AI systems. While still limited to individual token explanations, the ability to read an LLM's "mind" in real time is a game-changer for safety research and model auditing.
- Auto Verbalizer (AV) translates Gemma 3's internal activations into readable text for any token.
- Activation Reconstructor (AR) verifies accuracy by checking if the explanation can be converted back to original activations.
- Available now on Neuronpedia for Gemma 3 27b instruct – users can click any token to see what the model was thinking.
Why It Matters
Unprecedented token-level transparency into LLM reasoning, enabling deeper debugging, bias detection, and trust.