R³L improves 3D layout generation with reliable spatial reasoning
New framework fixes errors in multi-hop 3D spatial reasoning, achieving physically consistent layouts.
Relative spatial relations are a compact way to describe spatial structure, but current Multimodal Large Language Models (MLLMs) often produce unreliable inferences when chained across multiple steps. The accumulation of errors from repeated reference-frame transformations leads to semantic and metric drift. To solve this, Zhifeng Gu, Yuqi Wang, and Bing Wang from the paper 'R³L: Reasoning 3D Layouts from Relative Spatial Relations' propose a general framework that breaks coupled relation chains via invariant spatial decomposition. This method ensures that spatial relationships remain consistent regardless of the reference frame used.
Additionally, R³L employs a consistent spatial imagination loop that iteratively imagines and revises layouts to promote self-consistency. Supportive spatial optimization uses global-to-local coordinate re-parameterization to ease pose optimization. Extensive experiments on various scene types and instructions show that R³L produces layouts that are both physically feasible and semantically aligned with user intent. The code is open-sourced on GitHub, and the paper has been accepted at ICML 2026.
- Invariant spatial decomposition breaks coupled relation chains to prevent error accumulation in multi-hop reasoning.
- Consistent spatial imagination uses an imagine-and-revise loop to self-correct layout inconsistencies.
- Supportive spatial optimization re-parameterizes poses from global to local coordinates, improving optimization stability.
Why It Matters
Enables reliable 3D scene generation from natural language instructions, crucial for robotics, VR, and design tools.