Exemplar Partitioning cracks AI interpretability with 1000x less compute
Forget SAEs: one hyperparameter, no gradient descent, matches top baselines.
Most current feature discovery in language models relies on sparse autoencoders (SAEs)—powerful but computationally heavy methods that combine reconstruction and sparsity losses via gradient descent. Jessica Rumbelow's Exemplar Partitioning (EP) takes a radically different approach: it builds a Voronoi partition of activation space using only leader-clustering with a single distance threshold hyperparameter. The method makes one streaming pass over activation tokens with no backward passes or gradient descent, producing a dictionary of exemplar activations that anchor regions. Each region can be interpreted via logit-lens decodes and member prompts.
On the AxBench latent concept-detection benchmark using Gemma-2-2B-it layer 20, EP at p₁ achieves 0.881 mean AUROC across 500 concepts—within 0.03 of the strongest SAE baseline (SAE-A). The compute savings are staggering: EP used only 3.6 million activation tokens, while the canonical GemmaScope 16k SAE was trained on ~4 billion tokens with ~10⁶ optimizer steps. EP's simplicity makes it ideal for rapid prototyping, causal intervention studies, and tracking how representations change across layers and inputs, without the overhead of training a full autoencoder.
- EP uses leader-clustering with one hyperparameter (distance threshold) to partition activation space into Voronoi cells.
- Achieves 0.881 mean AUROC on AxBench, only 0.03 below SAE-A, with 1000x less build compute (3.6M vs 4B tokens).
- No gradient descent or backprop needed; builds dictionary in a single streaming pass, enabling fast interpretability experiments.
Why It Matters
Democratizes mechanistic interpretability by drastically reducing compute requirements for feature discovery in LLMs.