Robotics

DIAL framework beats human driving with intent-amplified RL

DIAL overcomes mode collapse, surpassing human-driven demonstration benchmarks for first time.

Deep Dive

A new arXiv paper from researchers (Lu et al.) introduces DIAL (Driving-Intent-Amplified Reinforcement Learning), a framework that tackles mode collapse in continuous-action driving policies trained from single demonstrations. Standard policies cluster around the shown maneuver, limiting best-of-N performance. DIAL operates in two stages. First, it conditions a flow-matching action head on discrete intent labels using classifier-free guidance (CFG), expanding the sampling distribution across distinct maneuver modes. Second, it uses multi-intent GRPO (Group Relative Policy Optimization) to span all intent classes during preference RL, preventing fine-tuning from re-collapsing around the currently preferred mode.

Evaluated on the WOD-E2E benchmark, DIAL's intent-CFG sampling achieved a Rater Feedback Score (RFS) of 9.14 at best-of-128, beating both the human-driven demonstration (8.13) and the strongest prior method RAP (8.5 at best-of-64). Multi-intent GRPO improved held-out RFS from 7.681 to 8.211, while every single-intent baseline peaked lower and degraded by training end. These results indicate that the key bottleneck in preference RL for continuous-action policies is not just how to update the policy, but to expand and preserve the sampling distribution being optimized.

Key Points
  • DIAL uses intent-conditioned flow matching with CFG to expand sampling distribution across 8 rule-derived driving intents.
  • Achieved RFS 9.14 at best-of-128, surpassing human demonstration (8.13) and previous best (RAP 8.5).
  • Multi-intent GRPO improved held-out RFS from 7.681 to 8.211, while single-intent baselines degraded.

Why It Matters

This framework could enable safer, more diverse autonomous driving behaviors by overcoming mode collapse in RL training.