Research & Papers

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

Researchers close the perception-reasoning gap with a dynamic VLM-LLM question-answering loop.

Deep Dive

Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major hurdle. Researchers have identified a perception-reasoning-decision gap in standalone vision-language models (VLMs), which often overlook task-critical visual information. To address this, the team introduces PRISM—a framework that tightly couples perception (VLM) and decision-making (LLM) through a closed-loop dynamic question-answer (DQA) pipeline.

In PRISM, the LLM actively critiques the VLM's initial description, probes it with goal-oriented questions, and synthesizes a compact image description tailored to the task at hand. This yields a sharp, task-driven understanding of the scene. Evaluated on the ALFWorld and Room-to-Room (R2R) benchmarks, PRISM delivers systematic, substantial gains over existing state-of-the-art image-based models. Crucially, the entire pipeline is automatic, eliminating the need for handcrafted questions or answers. This breakthrough paves the way for more reliable embodied AI agents that can operate effectively in real-world, multimodal environments.

Key Points
  • PRISM uses a dynamic QA loop where the LLM critiques, probes, and synthesizes VLM descriptions for better task focus.
  • Outperforms state-of-the-art image-based models on ALFWorld and Room-to-Room (R2R) benchmarks.
  • Fully automatic—no need for manually crafted questions or answers, enabling real-time deployment.

Why It Matters

PRISM closes the perception-reasoning gap, enabling more reliable AI agents for robotics and autonomous navigation.