Researchers solve truncation bias in robotic policy evaluation with liveness-based Bellman operator
Finite rollout lengths no longer break policy evaluation for sparse-reward manipulation tasks
Policy evaluation is a critical bottleneck in robotic manipulation, especially when rewards are sparse and task progression is non-monotonic (policies may recover from failures). Standard evaluation methods assume infinite horizons via Bellman equations, but real-world rollouts are always finitely truncated — introducing truncation bias that misrepresents policy performance. Hao Wang, Joshua Bowden, Colton Crosby, and Somil Bansal from (likely) University of Southern California tackle this problem head-on in a paper accepted at RSS 2026. They propose a "discounted liveness formulation" that reframes evaluation as a task-completion problem, using a liveness-based Bellman operator that produces a conservative fixed-point value function. This function inherently accounts for finite horizons and sparse rewards, and the authors prove contraction guarantees for the operator.
Empirically, the method shines on two simulated manipulation benchmarks (using a VLA model and a diffusion policy) and a real cloth folding task with human demonstrations. It consistently reduces truncation bias and provides a more faithful measure of task progress compared to classical TD(0) and Monte Carlo baselines. By making offline evaluation reliable even with limited rollout lengths, this work enables practitioners to better assess policy quality before expensive real-world deployment — a key step toward safer, more data-efficient robot learning.
- Proposes a liveness-based Bellman operator that yields a conservative value function robust to finite-horizon truncation bias.
- Evaluated on simulated manipulation tasks with both Vision-Language-Action models and diffusion policies, plus a cloth folding task from human demos.
- Outperforms classical baselines like TD(0) and Monte Carlo policy evaluation in accurately reflecting task progress.
Why It Matters
More accurate offline policy evaluation unlocks safer, data-efficient robot learning for real-world deployment.