Research & Papers

New benchmark reveals VLMs struggle with embodied 3D interaction tasks

Current VLMs fail at low-level spatial tasks like grasping and trajectory prediction.

Deep Dive

A team led by Jiyao Zhang has released Embodied3DBench, a robot-centric benchmark designed to systematically evaluate low-level spatial intelligence of Vision Language Models (VLMs) in embodied 3D environments. The benchmark comprises 6 task categories split into two core groups: Spatial Structural Understanding (including grounding, spatial relation prediction, and multi-view correspondence) and Interaction-Oriented Perception (affordance prediction, grasp point prediction, and trajectory prediction). With 12 subcategories and over 21,000 high-quality question-answer pairs, the researchers tested 13 state-of-the-art models and found that while current VLMs possess relatively strong high-level spatial reasoning (e.g., object-to-object relations), they remain fragile in interaction-oriented perception—highlighting a significant lack of robust 3D-aware interaction priors.

To actively close this capability gap, the researchers synthesized a large-scale training dataset comprising 1.3 million QA pairs focused on low-level spatial tasks. Fine-tuning on this dataset led to notable improvements across the tested models, demonstrating that targeted data can enhance interaction-aware spatial reasoning. Embodied3DBench thus fills a critical void by providing both a systematic evaluation framework and a scalable data solution. This sets a clear target for developing multimodal systems that can truly understand and act in complex 3D environments—a key step toward more capable robotics and embodied AI agents.

Key Points
  • 6 task categories split into Spatial Structural Understanding (3 tasks) and Interaction-Oriented Perception (3 tasks), with 12 subcategories and 21k QA pairs.
  • Evaluated 13 state-of-the-art VLMs: strong on object-to-object spatial relations but consistently poor on grasp point and trajectory prediction.
  • Synthesized 1.3M QA training dataset; fine-tuning yields significant gains in low-level interaction perception.

Why It Matters

Identifies critical gap in VLM spatial reasoning for robotics, with a scalable data solution to fix it.