Robotics

Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining

arXiv cs.RO April 21, 2026

⚡A new AI framework decouples vision and action training, unlocking web video data for robots.

Deep Dive

A research team has introduced DeFI (Disentangled Forward and Inverse dynamics pretraining), a novel framework designed to overcome a core limitation in current robot AI. Most advanced robots use Vision-Language-Action (VLA) models, but these entangle the tasks of predicting future video frames (forward dynamics) and deciding on actions (inverse dynamics). This entanglement creates a misalignment between 2D image forecasting and 3D action prediction and, crucially, prevents models from learning from the vast amounts of action-free video data available on the web.

DeFI solves this by splitting the problem into two specialized, pre-trainable components. First, a General Forward Dynamics Model (GFDM) is trained on massive datasets of diverse human and robot videos purely to predict future visual states. Second, a General Inverse Dynamics Model (GIDM) uses self-supervised learning to infer the latent actions that caused transitions between video frames. These two expert models are then integrated and fine-tuned together for specific robotic tasks. This 'pretrain separately, cooperate later' approach allows each model to learn from its optimal data source before combining their strengths.

The results are state-of-the-art. On the challenging CALVIN benchmark, DeFI achieved an average task length of 4.51. It scored a 51.2% success rate on the SimplerEnv-Fractal benchmark and, most impressively, an 81.3% success rate in real-world robotic deployments. These figures represent a significant leap over previous entangled training methods. The work, accepted at ICLR 2026, demonstrates a more data-efficient and performant pathway toward generalist robots capable of complex, long-horizon reasoning by fundamentally rethinking how they learn from visual data.

Key Points

Decouples forward (video prediction) and inverse (action) dynamics into separate models, GFDM and GIDM, for more efficient pretraining.
Leverages massive, action-free web video data for the GFDM, a previously untapped resource for robot learning.
Achieved 81.3% real-world deployment success and a 4.51 average task length on CALVIN, outperforming prior VLA models.

Why It Matters

Unlocks web-scale video for robot training, enabling more capable and data-efficient generalist robots for real-world tasks.

Read Original Article

Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining

Why It Matters

Stay Ahead in AI