Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs
A new method uses a 'Spherical Gaussian Latent Policy' to let AI models reason visually in continuous space.
A team of researchers has introduced 'Decompose, Look, and Reason' (DLR), a novel framework designed to solve a core weakness in current Vision-Language Models (VLMs). VLMs like GPT-4V or Claude 3 often struggle with complex, multi-step visual reasoning because converting an image into a textual description (a process known as Chain-of-Thought or CoT) loses crucial visual information. Existing fixes either rely on expensive external tool calls or use patch-based embeddings that fail to capture the full semantic context needed for reasoning. DLR directly addresses this by keeping reasoning in a continuous visual latent space.
The DLR framework operates in three dynamic stages: it first decomposes a user's query into textual premises, then extracts premise-conditioned continuous visual features from the image, and finally deduces an answer through grounded rationales. The key innovation is the 'Spherical Gaussian Latent Policy,' a reinforcement learning technique that enables the model to effectively explore and reason within this high-dimensional visual latent space, rather than being constrained to discrete text. According to the paper, extensive experiments on vision-centric benchmarks show DLR consistently outperforms strong baselines, including text-only CoT, interleaved multimodal CoT, and other latent reasoning methods. Beyond raw performance, the framework provides superior stepwise interpretability, allowing users to see the visual premises the model considered at each stage of its reasoning process.
- Proposes the 'Decompose, Look, and Reason' (DLR) framework to fix visual information loss in VLM reasoning.
- Introduces a novel 'Spherical Gaussian Latent Policy' for reinforcement learning in continuous visual latent space.
- Outperforms existing text-only, multimodal CoT, and latent reasoning methods on benchmarks while offering better interpretability.
Why It Matters
This research could lead to more reliable and transparent AI for complex tasks like medical image analysis, scientific discovery, and autonomous robotics.