Robotics

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

arXiv cs.RO March 03, 2026

⚡A new framework repurposes pretrained video generators to teach robots physics, beating top models without action-specific training.

Deep Dive

A research team led by Zijian Song has published a groundbreaking paper titled 'Learning Physics from Pretrained Video Models,' introducing the PhysGen framework. This work tackles a core challenge in robotics—the scarcity of large-scale training data—by ingeniously repurposing existing foundation models. Instead of training robots from scratch, PhysGen uses pretrained autoregressive video generation models as a proxy for a physics simulator. The key innovation is a 'multimodal continuous representation' that bridges the gap between discrete video frames and continuous robotic control by unifying video and action data into shared physical tokens. This allows the system to transfer implicit physical knowledge, such as object dynamics and permanence, learned from vast video datasets directly to manipulation tasks.

Technically, PhysGen incorporates several methods for efficient convergence, including causal masking, inverse kinematics, Lookahead Multi-Token Prediction (L-MTP), and key-value caching. Its performance is validated on major benchmarks: it surpassed robust baselines OpenVLA and WorldVLA by 13.8% and 8.8%, respectively, on the Libero and ManiSkill benchmarks. Most notably, in real-world tests, PhysGen matched the performance of large-scale action-pretrained models like π₀ without requiring any prior action-specific pretraining, excelling in physically complex tasks like grasping transparent objects. This demonstrates a scalable, data-efficient path toward more generalizable robotic manipulation by extracting 'physical intuition' from the world's video data.

Key Points

PhysGen framework uses pretrained video generators (e.g., Sora) as physics simulators, bypassing the need for massive robotic datasets.
Outperformed leading models OpenVLA and WorldVLA by 13.8% and 8.8% on standard benchmarks without action-pretraining.
Matched performance of large action-pretrained model π₀ in real-world tasks, showing superior ability to grasp transparent objects.

Why It Matters

It provides a scalable, data-efficient blueprint for teaching robots complex physical reasoning by leveraging existing AI, accelerating development.

Read Original Article

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

Why It Matters

Stay Ahead in AI