Research & Papers

Belief2-Attention boosts vision models with dual-component attention

New mechanism uses both perpendicular and projected signals for richer token correlation.

Deep Dive

In a new arXiv preprint, researcher Guoqiang Zhang introduces Belief2-Attention, a refined attention mechanism for vision tasks that builds on the earlier Belief-Attention framework. The original Belief-Attention performed an orthogonal projection of the softmax-weighted summation of V vectors onto the original V vectors, using the perpendicular component as a residual signal. Zhang's ablation study reveals that the projected component also carries significant token correlation information that was previously discarded.

Belief2-Attention addresses this by utilizing both components. The projected component is processed through an activation function and a linear mapping before being merged back into the token representation. This effectively turns the projected pathway into a two-layer feedforward network embedded within the attention block itself. Furthermore, the mechanism introduces an additional inner-product matrix ZZ^T alongside the standard QK^T to capture richer pairwise token relationships. Zhang demonstrates mathematically that Belief2-Attention is more expressive than standard attention.

The proposed method was empirically validated on image classification and segmentation benchmarks, showing consistent improvements over both standard attention and the original Belief-Attention. While specific performance numbers are not detailed in the abstract, the paper claims effectiveness across these core vision tasks. This work points toward a more complete utilization of attention outputs, potentially reducing information loss in transformer-based vision models.

Key Points
  • Belief2-Attention uses both perpendicular and projected components from softmax-weighted V vectors, not just the residual signal.
  • The projected component is processed via an activation function and linear mapping, forming a two-layer FFN inside the attention block.
  • A new inner-product matrix ZZ^T is added to QK^T to capture richer token correlations, tested on image classification and segmentation.

Why It Matters

More expressive attention means higher accuracy in vision tasks like autonomous driving and medical imaging.