Pando: A Controlled Benchmark for Interpretability Methods
New benchmark with 720+ fine-tuned LLMs shows gradient methods outperform blackbox approaches in decoding model decisions.
A research team from Carnegie Mellon University and Meta has introduced Pando, a groundbreaking benchmark for evaluating AI interpretability methods. The benchmark features 720+ fine-tuned language models where each model's decision-making follows a known decision tree structure, creating a controlled environment with clear ground truth. Crucially, the models were trained under three distinct conditions: with faithful rationales, with no rationales (just yes/no labels), or with deliberately unfaithful rationales—allowing researchers to test interpretability methods in safety-critical scenarios where blackbox approaches typically fail.
In their comprehensive evaluation, the researchers discovered that gradient-based methods, particularly those using gradient attribution norms like RelP, consistently outperformed blackbox baselines. These methods successfully decoded model decisions even when rationales were absent or misleading. Meanwhile, non-gradient interpretability methods—including five different variants of sparse autoencoders (SAEs)—struggled to surpass blackbox performance despite extensive tuning efforts. The team implemented strict controls to ensure fair comparison, fixing input sets and eliminating confounding factors like scenario names and prompt formatting that could skew results.
The findings reveal significant challenges in the interpretability field, demonstrating how easy it is to fall into 'interpretability illusions' where methods appear effective on test examples but fail under rigorous evaluation. Pando represents the first benchmark designed specifically to compare interpretability methods against each other and against blackbox approaches using identical inputs and conditions. The researchers are considering moving to a budget-based evaluation system in future iterations, though designing effective prompt selection strategies remains an open challenge.
- Pando benchmark includes 720+ fine-tuned LLMs with known decision trees and three training conditions (faithful, no, or unfaithful rationales)
- Gradient-based methods (like RelP attribution) outperformed blackbox baselines, while non-gradient methods (including 5+ SAE variants) consistently struggled
- First apples-to-apples comparison fixing input sets and eliminating confounders like scenario names and prompt formatting
Why It Matters
Provides rigorous testing ground for AI safety tools, revealing which interpretability methods actually work when models can't explain themselves.