dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
Unifies vision, language, and actions into one token space for evaluation.
dWorldEval tackles the scalability problem in robotic policy evaluation by introducing a discrete diffusion world model. Traditional methods require running policies across thousands of environments and tasks, which is computationally prohibitive. The researchers unify all modalities—vision, language, and robotic actions—into a single token space, then use a transformer-based denoising network to model them jointly. This architecture includes a sparse keyframe memory to maintain spatiotemporal consistency and a progress token that tracks task completion. During inference, the model predicts future observations and the progress token, automatically determining success when the progress reaches 1.
Extensive experiments show that dWorldEval significantly outperforms prior approaches like WorldEval, Ctrl-World, and WorldGym on benchmarks including LIBERO, RoboTwin, and multiple real-robot tasks. The unified token space and transformer backbone enable the model to generalize across diverse tasks and environments without requiring explicit simulation of each scenario. This work paves the way for a new architectural paradigm in building world simulators for robotics evaluation at scale, potentially reducing the cost and time needed to validate robotic policies in real-world applications.
- Maps vision, language, and robotic actions into a unified token space using a transformer-based denoising network.
- Introduces sparse keyframe memory for spatiotemporal consistency and a progress token for automatic task completion detection.
- Outperforms WorldEval, Ctrl-World, and WorldGym on LIBERO, RoboTwin, and real-robot tasks.
Why It Matters
Enables scalable, cost-effective robotic policy evaluation without exhaustive real-world testing.