SeeTraceAct: Robots learn new tasks from a single human demo video
One-shot robot learning from human video, no teleoperation needed.
Training robots for new tasks typically requires costly task-specific teleoperation data. In a new paper, researchers present SeeTraceAct, a demo-conditioned vision-language-action model that learns from just one demonstration video of an unseen task—even if the demo comes from a different embodiment (e.g., a human). The key innovation is visibility-aware prediction of future end-effector traces, which forces the model to precisely localize small target regions rather than relying on coarse visual cues. This one-shot approach dramatically reduces data collection costs.
To enable reproducible evaluation with cross-embodiment demonstrations, the team releases RoboCasa-DC, an extension of RoboCasa with episode-paired humanoid videos. Experiments on both RoboCasa-DC and a real-world setup—where a Franka Panda arm is conditioned on human demonstrations—show that SeeTraceAct outperforms existing end-to-end baselines, achieving the best success rate across all four simulated settings and a 12.5 percentage point improvement in real-world average success. This work paves the way for robots that can learn new skills from casual human videos, without expensive teleoperation or hundreds of examples.
- SeeTraceAct enables one-shot learning from a single human demonstration video, no teleoperation required.
- Uses visibility-aware prediction of future end-effector traces for precise spatial grounding of small targets.
- Achieves 12.5 percentage point improvement in real-world success rate on a Franka Panda arm conditioned on human demos.
Why It Matters
One-shot robot learning from human video drastically cuts data costs, enabling rapid task adaptation in real-world settings.