Safe Reinforcement Learning with Preference-based Constraint Inference
New AI training method uses a 'dead zone' mechanism to better capture complex safety rules from human feedback.
A team of researchers has introduced a novel method called Preference-based Constrained Reinforcement Learning (PbCRL) to tackle a core challenge in Safe Reinforcement Learning (RL). The problem is that real-world safety constraints for AI agents are often complex, subjective, and difficult to explicitly program. While learning these constraints from human preferences is efficient, the popular Bradley-Terry models used for this fail to capture the 'asymmetric, heavy-tailed' nature of safety violations, leading to dangerous risk underestimation.
PbCRL addresses this by incorporating two key innovations: a 'dead zone' mechanism in its preference model and a Signal-to-Noise Ratio (SNR) loss. The dead zone mechanism is theoretically proven to encourage the model to learn heavy-tailed cost distributions, which better aligns it with true safety requirements. The SNR loss encourages exploration by focusing on cost variances, which improves policy learning. The method also uses a two-stage training strategy to reduce the burden of collecting online human feedback while adaptively improving safety compliance. Empirical results show PbCRL achieves superior alignment with true safety constraints and outperforms existing state-of-the-art baselines in both safety and task performance metrics.
- Introduces a 'dead zone' mechanism for preference modeling to capture heavy-tailed safety cost distributions, proven to improve constraint alignment.
- Outperforms state-of-the-art baselines in empirical tests, achieving better safety and reward performance for AI agents.
- Uses a two-stage training strategy to lower the need for continuous online human labeling while enhancing safety compliance.
Why It Matters
Enables safer deployment of AI in critical real-world applications like autonomous vehicles and robotics by reliably inferring complex safety rules.