Applied Explainability for Large Language Models: A Comparative Study
New paper compares three explainability techniques, finding gradient attribution beats attention-based approaches for stability.
A new preprint from researcher Venkata Abhinandan Kancharla provides a practical, head-to-head comparison of three prominent explainability techniques for transformer-based language models. The study, titled "Applied Explainability for Large Language Models: A Comparative Study," evaluates Integrated Gradients, Attention Rollout, and SHAP on a fine-tuned DistilBERT model performing SST-2 sentiment classification. Rather than proposing new methods, the research focuses on assessing the real-world behavior of existing approaches under consistent, reproducible conditions.
The findings reveal clear trade-offs. Gradient-based attribution methods like Integrated Gradients provided the most stable and intuitive explanations, closely aligning with features relevant to the model's predictions. Attention-based approaches, while computationally efficient, showed less alignment with prediction-relevant features. Model-agnostic methods like SHAP offered flexibility but introduced higher computational costs and greater variability in their explanations.
This work emphasizes that explainability techniques should be viewed as diagnostic tools rather than definitive explanations. The comparative framework and practical insights help researchers and engineers make informed choices when implementing transparency features in production NLP systems, balancing computational efficiency against explanation quality and stability.
- Gradient-based attribution (Integrated Gradients) provided most stable and intuitive explanations for DistilBERT sentiment analysis
- Attention-based methods (Attention Rollout) were computationally efficient but less aligned with prediction-relevant features
- Model-agnostic approaches (SHAP) offered flexibility but with higher computational cost and variability
Why It Matters
Provides practical guidance for engineers implementing explainable AI in production systems, helping balance transparency with performance.