The Token Replacement Test (TRT) isolates whether performance depends on token content or just token presence by replacing thoughts with zero, random, or other alternatives?

The Token Replacement Test (TRT) isolates whether performance depends on token content or just token presence by replacing thoughts with zero, random, or other alternatives.

Tested on LLaVA-13B and Qwen2.5-VL-3B with multiple frozen encoders (SigLIP2, CLIP, DINOv2) and token budgets, plus three commercial systems (Mirage, Mull-Tokens, CoVT)?

Tested on LLaVA-13B and Qwen2.5-VL-3B with multiple frozen encoders (SigLIP2, CLIP, DINOv2) and token budgets, plus three commercial systems (Mirage, Mull-Tokens, CoVT).

Across all tests, accuracy gains persisted even with corrupted tokens—proving models don't actually rely on the reasoning content of thought tokens?

Across all tests, accuracy gains persisted even with corrupted tokens—proving models don't actually rely on the reasoning content of thought tokens.

Research & Papers

Study Finds Vision-Language Models Ignore 'Thinking' Tokens—Accuracy Gains Are Misleading

arXiv cs.CV May 22, 2026

⚡New research shows VLMs improve accuracy even when 'thought tokens' are replaced with random noise.

Deep Dive

A new paper from researchers Tianyi Zhang, Mahtab Bigverdi, and Ranjay Krishna challenges the growing trend of adding continuous or latent non-textual tokens to vision-language models (VLMs) for 'visual thinking.' While these tokens often improve task accuracy, the team argues that such gains alone do not prove the tokens are used for reasoning. Confounds like added context length, special-token anchoring, or training regularization could be responsible. To address this, they formalize the Ablate-to-Validate diagnostic principle and instantiate it as the Token Replacement Test (TRT). TRT holds prompt, image, token budget, and decoding fixed while replacing intermediate tokens with zero, random, first-repeat, or oracle alternatives—isolating whether performance depends on token content or mere presence.

Applying TRT to LLaVA-13B and Qwen2.5-VL-3B on relative depth reasoning, as well as to three off-the-shelf visual-thinking systems (Mirage, Mull-Tokens, CoVT) on BLINK, VSP, and CV-Bench benchmarks, the results are stark: accuracy gains are a misleading proxy for latent-token reasoning. Across all settings, VLMs retained most improvement even when token content was corrupted or replaced. This reveals a persistent gap between having a latent channel and using it as an information bottleneck. The authors recommend TRT as a standard diagnostic alongside accuracy for any method introducing continuous thought tokens.

Key Points

The Token Replacement Test (TRT) isolates whether performance depends on token content or just token presence by replacing thoughts with zero, random, or other alternatives.
Tested on LLaVA-13B and Qwen2.5-VL-3B with multiple frozen encoders (SigLIP2, CLIP, DINOv2) and token budgets, plus three commercial systems (Mirage, Mull-Tokens, CoVT).
Across all tests, accuracy gains persisted even with corrupted tokens—proving models don't actually rely on the reasoning content of thought tokens.

Why It Matters

Challenges the assumption that 'thinking tokens' enhance reasoning; developers must validate token utility rigorously.

Read Original Article

Study Finds Vision-Language Models Ignore 'Thinking' Tokens—Accuracy Gains Are Misleading

Why It Matters

Related Articles

🚀 Stay Ahead in AI