Disillusionment with mechanistic interpretability research [D]
A new technique 'confabulates' explanations, raising concerns about the field's direction.
In a viral Reddit post, an undergraduate computer scientist voices growing disillusionment with mechanistic interpretability (mech interp) research, specifically targeting Anthropic's latest work. The researcher describes being swept up in the mech interp wave circa 2024, embracing methods like sparse autoencoders and attribution graphs. However, Anthropic's recent blogpost on 'natural language autoencoders' has shaken their confidence. The technique involves training one LLM to compress model activations into a natural language description, and another LLM to reconstruct the activations from that description.
The post raises several technical concerns. First, the approach is inherently a black box—the 'activation verbalizer' module operates without transparency, making it difficult to trust explanations of model internals. Notably, Anthropic did not compare basic metrics like fraction of variance explained (FVE) or reconstruction error against established SAE baselines. Most troubling are 'confabulations': instances where the verbalizer fabricates explanations, yet at test time there's no way to know whether an explanation is genuine or confabulated. The author acknowledges the blogpost discusses these issues and achieves decent results on a misaligned model auditing benchmark, but questions the utility of such benchmarks given skepticism about AI x-risk arguments.
The broader critique is that Anthropic seems less interested in interpretability for its own sake and more in scalable alignment/oversight as a means to solve the 'control problem.' Given the field's heavy reliance on Anthropic's direction, the post worries this trend will pull mech interp away from genuinely understanding models toward shallow, auditable proxies. The author later edited the post to thank respondents who helped change their perspective, but the initial concerns remain a bellwether for tensions in the community.
- Anthropic's natural language autoencoders use a black-box verbalizer that can 'confabulate' explanations, undermining interpretability goals.
- The paper omits basic metrics like FVE and reconstruction error compared to SAE baselines.
- The author fears Anthropic's prioritization of scalable alignment over true interpretability will misdirect the entire field.
Why It Matters
If mech interp shifts to black-box methods, it risks losing rigor and failing to truly understand AI models—just as regulators demand transparency.