From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
A new taxonomy maps three ways AI can translate passive video into actionable robot commands.
A team of researchers from leading institutions has published a seminal survey on arXiv, analyzing the frontier of teaching robots to manipulate objects by learning from video data alone. The paper, 'From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data,' tackles the core challenge that while video is a scalable source of physical dynamics, it lacks the action labels and embodiment-specific data needed for direct robot control. The authors propose a novel, interface-centric taxonomy to organize the field, moving beyond model architectures to focus on where and how the translation from visual observation to physical action occurs.
The survey identifies three primary families of methods. First, direct video-to-action policies that keep the control interface implicit within a neural network. Second, latent-action methods that route temporal video structure through a compact, learned intermediate representation. Third, explicit visual interfaces that predict interpretable targets, like object poses or trajectories, for a downstream controller. For each family, the analysis rigorously examines control-integration properties: how the control loop is closed, what can be verified before execution, and where failures are most likely to enter the system.
A key synthesis of the research reveals that the most pressing open challenges are not in video understanding itself, but in the 'robotics integration layer'—the critical mechanisms that connect video-derived predictions to safe, dependable, and repeatable robot behavior in the real world. The paper concludes by outlining specific research directions to close this gap, emphasizing the need for methods that ensure robustness and reliability when bridging the sim-to-real and video-to-action divides.
- Introduces a new 'interface-centric taxonomy' for video-to-control methods, moving beyond model architecture to focus on the translation point.
- Identifies three core method families: direct policies, latent-action models, and explicit visual interfaces, analyzing each for control-loop integration.
- Pinpoints the 'robotics integration layer' as the major unsolved challenge, outlining research needed to connect video predictions to reliable robot actions.
Why It Matters
This roadmap is critical for developing robots that can learn complex manipulation skills autonomously from the vast, unlabeled video data on the internet.