Spatio-Temporal Grounding of Large Language Models from Perception Streams
A new method uses formal logic to generate unlimited training data, boosting a tiny model's accuracy by 39%.
A team of researchers led by Jacob Anderson has developed a novel framework called Formally Explainable Spatio-Temporal Scenes (FESTS) to solve a core problem in embodied AI. Current large language models often fail at precise spatial reasoning—understanding how objects move and interact in 3D space over time. FESTS addresses this by compiling natural language queries (like 'find when the cup moves left of the plate') into a formal logic language called Spatial Regular Expressions (SpRE). This logic can then be automatically matched against structured video logs, generating perfectly aligned training data—(query, video frames, match, explanation) tuples—without any manual labeling. This pipeline creates an unlimited, self-supervised data source for teaching models complex spatio-temporal concepts.
Using this method, the team trained a relatively tiny 3-billion-parameter model on just 27,000 of these generated data tuples. The results were dramatic: the model's accuracy on frame-level spatio-temporal reasoning tasks (measured by F1 score) skyrocketed from 48.5% to 87.5%. Crucially, this performance matches that of OpenAI's massive GPT-4.1 model on complex reasoning benchmarks, despite being approximately two orders of magnitude (or 100x) smaller in size. The breakthrough lies in the quality and precision of the synthetically generated training data, which provides verifiable, logic-grounded supervision that pure text training lacks.
This work fundamentally shifts how we can equip AI with spatial and temporal understanding. By moving from massive, compute-heavy models to smaller, more efficiently trained models with superior data, it opens the door for real-time spatio-temporal intelligence in resource-constrained environments. The immediate application is for Video LLMs and embodied AI agents in robotics, where understanding 'what happened when and where' is critical for navigation, manipulation, and interaction with the physical world.
- The FESTS framework uses Spatial Regular Expressions (SpRE) to turn language queries into formal logic, generating unlimited, label-free training data from video.
- A 3-billion-parameter model trained on 27k data tuples saw its F1 score jump from 48.5% to 87.5%, matching GPT-4.1's performance.
- The model achieves state-of-the-art spatio-temporal reasoning while being 100x smaller than frontier models like GPT-4.1, enabling efficient deployment.
Why It Matters
Enables complex, real-time spatial reasoning for robots and video AI using small, efficient models instead of massive, expensive ones.