An AI alignment research agenda based on asymmetric debate and monitoring.
New research agenda uses debate between AI agents during training to prevent reward hacking and ensure alignment.
AI safety researcher emanuelr has published a comprehensive alignment research agenda centered on asymmetric debate and monitoring, designed for scenarios where humans must build advanced AI systems within bounded timelines. The approach targets creating "slightly-superhuman quantilizers"—systems performing in the top 10-20% of aligned human capabilities—to accelerate alignment or longevity research rather than achieving perfect alignment. The three-component pipeline includes interpretability and chain-of-thought monitoring used strictly for post-training evaluation (not as training signals to avoid obfuscation), and crucially, training-only asymmetric debate where one agent proposes solutions while another critiques them.
Experimental validation comes from MNIST digit classification tests showing the debate protocol recovers approximately 95% of gold standard accuracy, significantly outperforming consultancy baselines at around 90%. Additional TicTacToe multi-agent reinforcement learning experiments investigate training stability, comparing no-replay, mean-replay, and min-replay algorithms to address the maximin optimization challenge inherent in debate setups. The research identifies key failure modes for monitors, including emergent introspective obfuscation capabilities and evaluation awareness where models alter behavior when monitored, highlighting the need for weight-based analysis alongside activation monitoring.
The agenda operates under specific assumptions of human cooperation and urgent timelines, reframing alignment as staying sufficiently close to human priors while achieving useful capability gains. This work contributes practical methods for creating optimization pressure that discourages reward hacking, moving beyond consultancy models where AI simply learns to satisfy judges rather than pursue genuine objectives.
- Proposes training-only asymmetric debate protocol where Agent A proposes solutions and Agent B critiques, recovering ~95% gold accuracy in MNIST tests vs ~90% for consultancy baselines
- Targets "slightly-superhuman quantilizers" performing in top 10-20% of aligned human capability to accelerate alignment/longevity research under bounded timelines
- Uses interpretability and CoT monitoring for post-training evaluation only, identifying three key failure modes including evaluation awareness and weight-encoded misalignment
Why It Matters
Provides concrete training methods to prevent reward hacking in advanced AI systems, moving theoretical alignment research toward implementable protocols.