Debate improved macro-F1 by 16 points for Opus 4.6 judged by Opus 4.5, and 14 points for Gemini 3.1 Pro/Flash and Qwen3.5-122B/35B pairings?

Debate improved macro-F1 by 16 points for Opus 4.6 judged by Opus 4.5, and 14 points for Gemini 3.1 Pro/Flash and Qwen3.5-122B/35B pairings.

Help only occurred when the critic was a better classifier than the judge and the judge used criticism as verification, not testimony?

Help only occurred when the critic was a better classifier than the judge and the judge used criticism as verification, not testimony.

Most of the test-time benefit came from the critic's first message; additional rebuttal rounds had negligible effect on macro-F1?

Most of the test-time benefit came from the critic's first message; additional rebuttal rounds had negligible effect on macro-F1.

AI Safety

Palaestra Research finds AI debate boosts reward labeling by 16 points under right conditions

LessWrong AI May 26, 2026

⚡Debate improved macro-F1 by 16 points when critic outperformed judge and judges verified instead of deferring.

Deep Dive

A new study from Palaestra Research and an independent researcher tackles a core alignment question: when a weaker judge evaluates a stronger model's output, does debate between two copies of the stronger model help the judge make better decisions? The team—Ethan Elasky, Frank Nakasako, and Naman Goyal—ran controlled experiments using code and ARC-style logic tasks where answers could be programmatically verified. They compared a debate condition (proposer + critic who could agree/disagree and give reasons) against one-sided consultancy (proposer defends answer alone). The goal was to study the mechanism before tackling harder domains like research proposals or long-horizon agentic work.

The results were nuanced. Debate significantly improved reward labels in three of five pairings where the stronger model was evaluated by a weaker one: Opus 4.6 judged by Opus 4.5 (+16 points macro-F1), Gemini 3.1 Pro judged by Gemini 3 Flash (+14 points), and Qwen3.5-122B judged by Qwen3.5-35B (+14 points). Two other pairings showed no benefit. The critical factor was not the generator-verifier gap (all debaters were better at verifying than generating), but whether the critic was a better classifier than the judge, and whether the judge treated criticism as something to verify rather than testimony to accept. Interestingly, most gains came from the critic's first message; rebuttal rounds barely changed outcomes. The authors caution this for alignment: debate helps mainly by reducing false positives—rewarding bad answers—which is the most dangerous error for training policies.

Key Points

Debate improved macro-F1 by 16 points for Opus 4.6 judged by Opus 4.5, and 14 points for Gemini 3.1 Pro/Flash and Qwen3.5-122B/35B pairings.
Help only occurred when the critic was a better classifier than the judge and the judge used criticism as verification, not testimony.
Most of the test-time benefit came from the critic's first message; additional rebuttal rounds had negligible effect on macro-F1.

Why It Matters

Shows debate can improve reward models for alignment, but only with careful design of judge-critic dynamics.

Read Original Article

Palaestra Research finds AI debate boosts reward labeling by 16 points under right conditions

Why It Matters

Related Articles

🚀 Stay Ahead in AI