Study Finds Vision-Language Models Ignore 'Thinking' Tokens—Accuracy Gains Are Misleading
New research shows VLMs improve accuracy even when 'thought tokens' are replaced with random noise.
A new paper from researchers Tianyi Zhang, Mahtab Bigverdi, and Ranjay Krishna challenges the growing trend of adding continuous or latent non-textual tokens to vision-language models (VLMs) for 'visual thinking.' While these tokens often improve task accuracy, the team argues that such gains alone do not prove the tokens are used for reasoning. Confounds like added context length, special-token anchoring, or training regularization could be responsible. To address this, they formalize the Ablate-to-Validate diagnostic principle and instantiate it as the Token Replacement Test (TRT). TRT holds prompt, image, token budget, and decoding fixed while replacing intermediate tokens with zero, random, first-repeat, or oracle alternatives—isolating whether performance depends on token content or mere presence.
Applying TRT to LLaVA-13B and Qwen2.5-VL-3B on relative depth reasoning, as well as to three off-the-shelf visual-thinking systems (Mirage, Mull-Tokens, CoVT) on BLINK, VSP, and CV-Bench benchmarks, the results are stark: accuracy gains are a misleading proxy for latent-token reasoning. Across all settings, VLMs retained most improvement even when token content was corrupted or replaced. This reveals a persistent gap between having a latent channel and using it as an information bottleneck. The authors recommend TRT as a standard diagnostic alongside accuracy for any method introducing continuous thought tokens.
- The Token Replacement Test (TRT) isolates whether performance depends on token content or just token presence by replacing thoughts with zero, random, or other alternatives.
- Tested on LLaVA-13B and Qwen2.5-VL-3B with multiple frozen encoders (SigLIP2, CLIP, DINOv2) and token budgets, plus three commercial systems (Mirage, Mull-Tokens, CoVT).
- Across all tests, accuracy gains persisted even with corrupted tokens—proving models don't actually rely on the reasoning content of thought tokens.
Why It Matters
Challenges the assumption that 'thinking tokens' enhance reasoning; developers must validate token utility rigorously.