Research & Papers

Elastic Attention Cores Let Vision Transformers Scale to 1024px with 90% Less Compute

A new block-sparse attention mechanism rivals DINOv3 accuracy while slashing quadratic cost.

Deep Dive

Vision Transformers (ViTs) have traditionally relied on dense self-attention that scales quadratically with image resolution, making high-resolution tasks computationally prohibitive. To address this, researchers introduce Elastic Attention Cores, a novel backbone that uses a core-periphery block-sparse attention structure. Instead of computing N² attention maps, the model selects a small set of C 'core' tokens that attend to the full periphery, while periphery tokens only attend to the cores. This reduces scaling to (2NC + C²), a dramatic improvement for large N. The approach is validated across resolutions from 256 to 1024 pixels, where it matches or exceeds the performance of state-of-the-art models like DINOv3.

A key innovation is the use of nested dropout during training, which allows the number of core tokens to be adjusted at test time without retraining. Decreasing cores produces wider, more diffuse attention; increasing them sharpens focus. The attention maps also exhibit emergent behavior: early layers show isotropic patterns, while deeper layers become semantically aligned. The model is stable across resolutions and can be used for both classification and dense prediction tasks. Code is available on GitHub. This opens the door to efficient high-resolution vision AI with flexible compute budgets.

Key Points
  • Sparse attention scales as (2NC + C²) instead of N² for N patches and C core tokens, enabling resolution up to 1024px.
  • Nested dropout allows elastic adjustment of core tokens at inference, trading compute for attention granularity.
  • Matches DINOv3 accuracy on classification and dense tasks, with emergent semantic attention in deeper layers.

Why It Matters

Enables high-resolution vision models for medical imaging and autonomous driving with adjustable compute cost.