DIAL uses intent-conditioned flow matching with CFG to expand sampling distribution across 8 rule-derived driving intents?

DIAL uses intent-conditioned flow matching with CFG to expand sampling distribution across 8 rule-derived driving intents.

Achieved RFS 9.14 at best-of-128, surpassing human demonstration (8.13) and previous best (RAP 8.5)?

Achieved RFS 9.14 at best-of-128, surpassing human demonstration (8.13) and previous best (RAP 8.5).

Multi-intent GRPO improved held-out RFS from 7.681 to 8.211, while single-intent baselines degraded?

Multi-intent GRPO improved held-out RFS from 7.681 to 8.211, while single-intent baselines degraded.

Robotics

DIAL framework beats human driving with intent-amplified RL

arXiv cs.RO May 14, 2026

⚡DIAL overcomes mode collapse, surpassing human-driven demonstration benchmarks for first time.

Deep Dive

A new arXiv paper from researchers (Lu et al.) introduces DIAL (Driving-Intent-Amplified Reinforcement Learning), a framework that tackles mode collapse in continuous-action driving policies trained from single demonstrations. Standard policies cluster around the shown maneuver, limiting best-of-N performance. DIAL operates in two stages. First, it conditions a flow-matching action head on discrete intent labels using classifier-free guidance (CFG), expanding the sampling distribution across distinct maneuver modes. Second, it uses multi-intent GRPO (Group Relative Policy Optimization) to span all intent classes during preference RL, preventing fine-tuning from re-collapsing around the currently preferred mode.

Evaluated on the WOD-E2E benchmark, DIAL's intent-CFG sampling achieved a Rater Feedback Score (RFS) of 9.14 at best-of-128, beating both the human-driven demonstration (8.13) and the strongest prior method RAP (8.5 at best-of-64). Multi-intent GRPO improved held-out RFS from 7.681 to 8.211, while every single-intent baseline peaked lower and degraded by training end. These results indicate that the key bottleneck in preference RL for continuous-action policies is not just how to update the policy, but to expand and preserve the sampling distribution being optimized.

Key Points

DIAL uses intent-conditioned flow matching with CFG to expand sampling distribution across 8 rule-derived driving intents.
Achieved RFS 9.14 at best-of-128, surpassing human demonstration (8.13) and previous best (RAP 8.5).
Multi-intent GRPO improved held-out RFS from 7.681 to 8.211, while single-intent baselines degraded.

Why It Matters

This framework could enable safer, more diverse autonomous driving behaviors by overcoming mode collapse in RL training.

Read Original Article

DIAL framework beats human driving with intent-amplified RL

Why It Matters

Related Articles

🚀 Stay Ahead in AI