Viral Wire

NTT's Rationale-Enhanced Decoding Makes LVLMs Explainable Without Extra Training

NTT solves the AI black-box problem by forcing LVLMs to actually use their own reasoning.

Deep Dive

NTT has introduced Rationale-Enhanced Decoding, a novel inference framework designed to make multimodal foundation models—specifically Large Vision-Language Models (LVLMs)—truly explainable. The core problem it solves: current chain-of-thought (CoT) reasoning in LVLMs often produces final outputs that ignore the model's own intermediate rationales. NTT's experiments showed that even when a rationale was replaced with irrelevant text, many LVLMs still generated the same answer as if the correct rationale had been provided. This undermines trust in AI systems for high-stakes applications.

The proposed framework decouples image-based inference from rationale-based inference, processing each stream independently and then merging them through ensemble decoding. This structural change forces the model to base its final answer on both the visual input and its explanatory reasoning, without requiring any retraining or fine-tuning. NTT will present this work at CVPR 2026 in Denver. The technology promises more reliable business decision-making and trustworthy collaboration between multiple AI agents, addressing a critical barrier to enterprise adoption of multimodal AI.

Key Points
  • NTT identified that existing LVLMs often ignore their own chain-of-thought rationales, producing answers unchanged even when rationales are swapped with irrelevant text.
  • Rationale-Enhanced Decoding separately processes visual and rationale inputs, then fuses them via ensemble decoding to ensure grounded outputs.
  • The framework requires no additional model training and will be presented at CVPR 2026, marking a step toward transparent AI for business and multi-agent systems.

Why It Matters

This technique could unlock trust in multimodal AI for regulated industries by making inference auditable and rationale-grounded.