EP uses leader-clustering with one hyperparameter (distance threshold) to partition activation space into Voronoi cells?

EP uses leader-clustering with one hyperparameter (distance threshold) to partition activation space into Voronoi cells.

Achieves 0.881 mean AUROC on AxBench, only 0.03 below SAE-A, with 1000x less build compute (3.6M vs 4B tokens)?

Achieves 0.881 mean AUROC on AxBench, only 0.03 below SAE-A, with 1000x less build compute (3.6M vs 4B tokens).

No gradient descent or backprop needed; builds dictionary in a single streaming pass, enabling fast interpretability experiments?

No gradient descent or backprop needed; builds dictionary in a single streaming pass, enabling fast interpretability experiments.

AI Safety

Exemplar Partitioning cracks AI interpretability with 1000x less compute

LessWrong AI May 16, 2026

⚡Forget SAEs: one hyperparameter, no gradient descent, matches top baselines.

Deep Dive

Most current feature discovery in language models relies on sparse autoencoders (SAEs)—powerful but computationally heavy methods that combine reconstruction and sparsity losses via gradient descent. Jessica Rumbelow's Exemplar Partitioning (EP) takes a radically different approach: it builds a Voronoi partition of activation space using only leader-clustering with a single distance threshold hyperparameter. The method makes one streaming pass over activation tokens with no backward passes or gradient descent, producing a dictionary of exemplar activations that anchor regions. Each region can be interpreted via logit-lens decodes and member prompts.

On the AxBench latent concept-detection benchmark using Gemma-2-2B-it layer 20, EP at p₁ achieves 0.881 mean AUROC across 500 concepts—within 0.03 of the strongest SAE baseline (SAE-A). The compute savings are staggering: EP used only 3.6 million activation tokens, while the canonical GemmaScope 16k SAE was trained on ~4 billion tokens with ~10⁶ optimizer steps. EP's simplicity makes it ideal for rapid prototyping, causal intervention studies, and tracking how representations change across layers and inputs, without the overhead of training a full autoencoder.

Key Points

EP uses leader-clustering with one hyperparameter (distance threshold) to partition activation space into Voronoi cells.
Achieves 0.881 mean AUROC on AxBench, only 0.03 below SAE-A, with 1000x less build compute (3.6M vs 4B tokens).
No gradient descent or backprop needed; builds dictionary in a single streaming pass, enabling fast interpretability experiments.

Why It Matters

Democratizes mechanistic interpretability by drastically reducing compute requirements for feature discovery in LLMs.

Read Original Article

Exemplar Partitioning cracks AI interpretability with 1000x less compute

Why It Matters

Related Articles

🚀 Stay Ahead in AI