Thinking with Spatial Code for Physical-World Video Reasoning
New method converts RGB video into explicit 3D representations, beating proprietary models on VSI-Bench.
A team of researchers, including Jieneng Chen and Alan Yuille, has introduced a novel AI framework called 'Thinking with Spatial Code' that fundamentally changes how AI understands video. The core innovation is a spatial encoder that transforms standard RGB video frames into explicit, structured 3D representations. This process generates temporally coherent data featuring 3D oriented bounding boxes and semantic labels for objects, effectively creating a 'spatial code' that describes the scene in geometric terms.
This spatial code acts as a powerful interface for large language models (LLMs). Instead of forcing LLMs to interpret raw pixels, the framework lets them reason directly over these explicit spatial variables—like object positions, sizes, and trajectories. The researchers unified 6D object parsing and tracking backbones with geometric prediction to build the encoder, then fine-tuned LLMs using reinforcement learning guided by a 'spatial rubric reward.' This reward function specifically encourages perspective-aware and geometrically grounded inferences.
The result is a model that excels at physical-world visual question answering (VQA). On the challenging VSI-Bench, which tests spatial and temporal reasoning in video, this approach has set a new state-of-the-art, outperforming existing proprietary vision-language models. The work demonstrates that providing LLMs with explicit, structured 3D scene understanding, rather than implicit visual features, leads to more accurate and reliable reasoning about dynamic real-world environments. The code for the project is publicly available, promoting further research in this direction.
- Framework parses video into explicit 3D representations with bounding boxes and labels, creating a 'spatial code'.
- Enables LLMs to reason over spatial variables directly, outperforming proprietary models on VSI-Bench.
- Uses reinforcement learning with a novel spatial rubric reward to train for geometrically grounded inference.
Why It Matters
Enables more reliable AI for robotics and autonomous systems by providing explicit 3D understanding of dynamic environments.