Robotics

CRAFT: Video Diffusion for Bimanual Robot Data Generation

Researchers' new framework creates photorealistic bimanual robot videos from simulation, boosting success rates without real-world replay.

Deep Dive

A team from the University of Southern California and UC Berkeley has introduced CRAFT (Canny-guided Robot Data Generation using Video Diffusion Transformers), a novel framework that tackles a fundamental bottleneck in robotics: the scarcity and high cost of diverse, real-world training data for bimanual (two-arm) manipulation. The system leverages a pre-trained video diffusion model, conditioning it on structural edge maps extracted from trajectories generated in a physics simulator. This approach allows CRAFT to synthesize temporally coherent, photorealistic videos of robot actions while automatically producing the corresponding action labels, all without needing to replay demonstrations on an actual, expensive robot setup.

CRAFT functions as a powerful data augmentation pipeline, capable of generating vast variations from a small seed of real demonstrations. It can modify object poses, shift camera viewpoints, alter lighting and backgrounds, and even simulate different robot embodiments. This massively expands the visual diversity of the training dataset, which is crucial for teaching robots to generalize beyond the narrow conditions seen during initial training. In experiments across both simulated and real-world bimanual tasks, policies trained on CRAFT-generated data showed improved success rates compared to those trained with standard data augmentation or simply more real data.

The research demonstrates that modern generative AI, specifically video diffusion models, can be effectively repurposed to solve core robotics challenges. By bridging the simulation-to-reality (Sim2Real) gap with photorealistic video generation, CRAFT offers a scalable and cost-effective path to acquiring the large, varied datasets necessary for robust robot learning. This method could significantly accelerate progress in complex manipulation tasks, from industrial assembly to domestic assistance, by reducing dependency on slow and expensive physical data collection.

Key Points
  • Generates photorealistic training videos from simulation data using a conditioned video diffusion model, creating action-consistent demonstrations.
  • Acts as a unified augmentation pipeline, varying object poses, camera views, lighting, and robot embodiments from few real demos.
  • Improves robot policy success rates in bimanual tasks over standard methods, proving the value of diffusion for data diversity.

Why It Matters

Drastically reduces the cost and time of collecting robot training data, enabling more robust and generalizable bimanual manipulation policies.