Robotics

Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining

A new AI framework decouples vision and action training, unlocking web video data for robots.

Deep Dive

A research team has introduced DeFI (Disentangled Forward and Inverse dynamics pretraining), a novel framework designed to overcome a core limitation in current robot AI. Most advanced robots use Vision-Language-Action (VLA) models, but these entangle the tasks of predicting future video frames (forward dynamics) and deciding on actions (inverse dynamics). This entanglement creates a misalignment between 2D image forecasting and 3D action prediction and, crucially, prevents models from learning from the vast amounts of action-free video data available on the web.

DeFI solves this by splitting the problem into two specialized, pre-trainable components. First, a General Forward Dynamics Model (GFDM) is trained on massive datasets of diverse human and robot videos purely to predict future visual states. Second, a General Inverse Dynamics Model (GIDM) uses self-supervised learning to infer the latent actions that caused transitions between video frames. These two expert models are then integrated and fine-tuned together for specific robotic tasks. This 'pretrain separately, cooperate later' approach allows each model to learn from its optimal data source before combining their strengths.

The results are state-of-the-art. On the challenging CALVIN benchmark, DeFI achieved an average task length of 4.51. It scored a 51.2% success rate on the SimplerEnv-Fractal benchmark and, most impressively, an 81.3% success rate in real-world robotic deployments. These figures represent a significant leap over previous entangled training methods. The work, accepted at ICLR 2026, demonstrates a more data-efficient and performant pathway toward generalist robots capable of complex, long-horizon reasoning by fundamentally rethinking how they learn from visual data.

Key Points
  • Decouples forward (video prediction) and inverse (action) dynamics into separate models, GFDM and GIDM, for more efficient pretraining.
  • Leverages massive, action-free web video data for the GFDM, a previously untapped resource for robot learning.
  • Achieved 81.3% real-world deployment success and a 4.51 average task length on CALVIN, outperforming prior VLA models.

Why It Matters

Unlocks web-scale video for robot training, enabling more capable and data-efficient generalist robots for real-world tasks.