Research & Papers

Is Attention sink without Positional Encoding unavoidable? [D]

Removing positional encoding causes every query to attend to the same key tokens.

Deep Dive

A machine learning practitioner experimenting with small Transformers (encoder-decoder and cross-attention-only memory models) observed a persistent attention sink when Positional Encoding (PE) is removed. Whether using self-attention or cross-attention, attention heatmaps show vertical hot lines—every query vector attends to the same few key tokens. Adding RoPE or other PEs introduces diagonal patterns, but the user expected cross-attention to not need PE since queries and keys represent different data. Regularization to spread attention only widens the vertical stripes, failing to produce query-dependent attention.

This phenomenon, known as "attention sink," is well-documented in the LLM literature (e.g., the 2023 paper by Xiao et al.). Without PE, the model defaults to attending to a fixed set of tokens (often the first token) because that provides a consistent signal. The user's attempt to force dynamic attention without PE may require additional architectural changes—such as incorporating query-key bias terms, using gating mechanisms, or applying causal masking tricks. For practitioners, this underscores that PE is not just a convenience but a critical enabler for positional sensitivity in attention mechanisms.

Key Points
  • Removing Positional Encoding from self/cross-attention produces vertical lines in attention heatmaps, meaning all queries attend to the same keys.
  • Even with regularization, attention remains uniform—no diagonal (token-dependent) patterns emerge.
  • The user tried both encoder-decoder and cross-attention-only models, confirming the issue is architecture-agnostic.

Why It Matters

Reinforces that Positional Encoding is essential for dynamic attention, even in cross-attention settings.