Debate underperformed consultancy in 11 of 16 conditions across GPT-5 and Qwen model pairs?

Debate underperformed consultancy in 11 of 16 conditions across GPT-5 and Qwen model pairs

Removing debate transcripts entirely matched or outperformed both debate and consultancy formats?

Removing debate transcripts entirely matched or outperformed both debate and consultancy formats

Judges default to endorsing plausible-sounding arguments when both participants are wrong?

Judges default to endorsing plausible-sounding arguments when both participants are wrong

AI Safety

AI Debate Research Finds Transcripts Harm Performance in Weak-Judge/Strong-Debater Tests

LessWrong AI February 27, 2026

⚡New study shows AI debate transcripts actually reduce accuracy compared to just showing answers.

Deep Dive

Researchers Ethan Elasky and Frank Nakasako have published a significant empirical study testing inference-time generative debates for scalable AI oversight. Their work, using weak-judge/strong-debater setups with GPT-5-mini/nano and Qwen3-8B/4B model pairs on BigCodeBench+ coding tasks and ARC-AGI reasoning problems, delivers mostly negative results. The core finding is that debate formats underperformed consultancy baselines in 11 of 16 tested conditions, challenging assumptions about debate's effectiveness for scalable oversight. Even more striking, removing debate transcripts entirely and showing judges only the proposed answers matched or outperformed both debate and consultancy approaches.

The study reveals a critical mechanism: judges tend to default to endorsing plausible-sounding arguments even when both debaters are wrong, and debate transcripts specifically amplify this problematic tendency. While best-of-4 speech selection showed some promise on ARC-AGI tasks, suggesting potential for RL-trained debaters, the overall results indicate current debate implementations may not provide the expected oversight benefits. This research represents a departure from previous multiple-choice settings by using verifiable ground truth answers, and introduces formats where participants freely choose positions rather than being assigned sides—important considerations for alignment research moving forward.

Key Points

Debate underperformed consultancy in 11 of 16 conditions across GPT-5 and Qwen model pairs
Removing debate transcripts entirely matched or outperformed both debate and consultancy formats
Judges default to endorsing plausible-sounding arguments when both participants are wrong

Why It Matters

Challenges fundamental assumptions about AI debate for scalable oversight, suggesting simpler approaches may work better.

Read Original Article

AI Debate Research Finds Transcripts Harm Performance in Weak-Judge/Strong-Debater Tests

Why It Matters

Related Articles

🚀 Stay Ahead in AI