Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
New method uses physics simulators to teach AI models like GPT-4V how fluids and objects move.
A research team from Shanghai Jiao Tong University and other institutions has identified a major weakness in today's top multimodal AI models like GPT-4V and Claude 3.5: they lack intuitive physics understanding. While excelling at static image recognition, these models struggle to predict how objects move, interact, and change over time, especially with complex materials like fluids. The researchers introduced two new benchmark tasks—Next Frame Selection (NFS) and Temporal Coherence Verification (TCV)—to isolate this capability, and found that even state-of-the-art models perform poorly, highlighting a fundamental gap between visual recognition and true physical reasoning.
To bridge this gap, the team developed Scene Dynamic Field (SDF), a cost-efficient fine-tuning framework that leverages data from physics simulators. By training models on simulated physical interactions, SDF teaches them the underlying principles of motion and dynamics. The results are significant, with improvements of up to 20.7% on fluid-based reasoning tasks and strong generalization to unseen physical scenarios. This work provides a practical pathway to creating more physically grounded and reliable AI assistants for applications in robotics, simulation, and real-world video analysis.
- Identified critical gap: Top MLLMs like GPT-4V fail at basic physics reasoning tasks like predicting object motion.
- Introduced Scene Dynamic Field (SDF): A fine-tuning method using physics simulator data to teach models intuitive dynamics.
- Achieved 20.7% performance gains: SDF significantly improved model accuracy on fluid reasoning and generalized to new physical domains.
Why It Matters
Enables more reliable AI for robotics, autonomous systems, and video analysis by grounding models in real-world physics.