Combines salience-weighted similarity and adaptive per-layer reduction in a single training-free framework?

Combines salience-weighted similarity and adaptive per-layer reduction in a single training-free framework

At 13.4G FLOPs on ImageNet-1k, Top-1 degradation only -1.06% vs -4.62% for DSM?

At 13.4G FLOPs on ImageNet-1k, Top-1 degradation only -1.06% vs -4.62% for DSM

Consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes?

Consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes

Research & Papers

AdaMerge speeds up Vision Transformers with smarter token merging

arXiv cs.CV May 28, 2026

⚡New training-free method cuts accuracy loss by 30% at high compression rates

Deep Dive

The quadratic cost of self-attention in Vision Transformers (ViTs) is a major bottleneck for deployment, especially on resource-constrained devices. Existing training-free token reduction methods like ToMe treat all tokens equally, but self-attention is non-uniform—some tokens are far more important. AdaMerge, proposed by researchers at (presumably) KAIST or similar, solves this with two key innovations: salience-weighted similarity, which uses column-wise feature affinity to weight tokens by importance during merging, and adaptive merging intensity, which dynamically adjusts the number of tokens to keep per layer based on similarity statistics. This preserves high-salience tokens and reduces information loss even under aggressive compression.

On ImageNet-1k with ViT-B/16, AdaMerge consistently beats ToMe, PiToMe, and DSM across all FLOPs-matched operating points. At 13.4G FLOPs (high compression), AdaMerge's Top-1 accuracy drops only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. The gap widens as compression increases, showing AdaMerge's superior handling of redundancy. As a training-free solution, it can be dropped into any ViT inference pipeline with no fine-tuning, making it ideal for real-time applications like autonomous driving, robotics, and mobile vision. The paper is submitted to NeurIPS 2026.

Key Points

Combines salience-weighted similarity and adaptive per-layer reduction in a single training-free framework
At 13.4G FLOPs on ImageNet-1k, Top-1 degradation only -1.06% vs -4.62% for DSM
Consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes

Why It Matters

Enables faster ViT deployment without retraining, critical for edge devices and real-time vision applications.

Read Original Article

AdaMerge speeds up Vision Transformers with smarter token merging

Why It Matters

Related Articles

🚀 Stay Ahead in AI