[D] VIT16 - Should I use all or only final attention MHA to generate attention heatmap?
A researcher's experiment challenges standard practice for visualizing AI vision models.
A researcher testing Vision Transformer (ViT) models found that averaging attention scores from all Multi-Head Attention (MHA) layers produced "better" heatmaps than using only the final layer, contradicting common research practices. The experiment also revealed confusing, consistent attention on image padding. This discrepancy raises fundamental questions about the reliability of attention heatmaps, a key tool for interpreting what these powerful vision models are actually "looking at" to make their predictions.
Why It Matters
If standard visualization methods are flawed, it undermines trust in our ability to understand and debug critical AI vision systems.