[R] Concept Influence: Training Data Attribution via Interpretability (Same performance and 20× faster than influence functions)
New technique attributes LLM behavior to semantic concepts instead of individual examples, slashing computation time.
Far AI researchers have introduced Concept Influence, a breakthrough method for training data attribution (TDA) that dramatically improves both speed and interpretability. The technique addresses two major limitations of traditional influence functions: their bias toward lexical overlap rather than semantic similarity, and their computational expense at large language model (LLM) scale.
Instead of attributing model outputs to individual training examples (∇θL(ztest)), Concept Influence attributes behavior to semantic directions represented by interpretable vectors like probes or sparse autoencoder (SAE) features (∇θf_v^ℓ(xtest)). This fundamental shift moves from asking "which data matches this output?" to "which data causes this behavior?" The results are striking: Concept Influence achieves 20× faster computation while maintaining accuracy, and simple probe methods work surprisingly well as first-order approximations.
In practical applications, the method proved exceptionally effective for AI safety. When applied to the OASST1 dataset, researchers could identify and remove just 5% of problematic training data, reducing harmful outputs by 3× while preserving full model capability. SAE clustering revealed semantic features driving behaviors with 2000× higher influence on relevant concepts. The approach also outperformed traditional influence functions across all datasets in detecting emergent misalignment.
This advancement enables more efficient model auditing, targeted dataset curation, and improved safety interventions. By making TDA both faster and more semantically meaningful, Concept Influence provides a practical tool for understanding and controlling what training data drives specific LLM behaviors.
- 20× faster computation than traditional influence functions while maintaining attribution accuracy
- Reduced harmful outputs by 3× using only 5% of OASST1 data while preserving model capabilities
- SAE clustering revealed semantic features with 2000× higher influence on relevant concepts
Why It Matters
Enables efficient auditing of training data for safety and alignment, making LLM behavior more transparent and controllable.