AdaMerge speeds up Vision Transformers with smarter token merging
New training-free method cuts accuracy loss by 30% at high compression rates
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
The quadratic cost of self-attention in Vision Transformers (ViTs) is a major bottleneck for deployment, especially on resource-constrained devices. Existing training-free token reduction methods like ToMe treat all tokens equally, but self-attention is non-uniform—some tokens are far more important. AdaMerge, proposed by researchers at (presumably) KAIST or similar, solves this with two key innovations: salience-weighted similarity, which uses column-wise feature affinity to weight tokens by importance during merging, and adaptive merging intensity, which dynamically adjusts the number of tokens to keep per layer based on similarity statistics. This preserves high-salience tokens and reduces information loss even under aggressive compression.
On ImageNet-1k with ViT-B/16, AdaMerge consistently beats ToMe, PiToMe, and DSM across all FLOPs-matched operating points. At 13.4G FLOPs (high compression), AdaMerge's Top-1 accuracy drops only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. The gap widens as compression increases, showing AdaMerge's superior handling of redundancy. As a training-free solution, it can be dropped into any ViT inference pipeline with no fine-tuning, making it ideal for real-time applications like autonomous driving, robotics, and mobile vision. The paper is submitted to NeurIPS 2026.
- Combines salience-weighted similarity and adaptive per-layer reduction in a single training-free framework
- At 13.4G FLOPs on ImageNet-1k, Top-1 degradation only -1.06% vs -4.62% for DSM
- Consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes
Why It Matters
Enables faster ViT deployment without retraining, critical for edge devices and real-time vision applications.