Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added
An adversarial debate benchmark reveals surprising performance shifts among top LLMs.
The LLM Debate Benchmark has been updated with 10 new model entries, including GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning. The benchmark uses adversarial, multi-turn debates across 683 curated motions, with each pair debating twice (sides swapped). Scores are Bradley-Terry ratings on an Elo-like scale centered at 1500. A three-model panel judges each debate, with a mean cross-judge winner agreement of 0.55 on overlapping side-swapped matchups.
Notable results: Opus 4.7 remains the leader at 1711 BT. GPT-5.5 (high reasoning) enters at 1574, well below GPT-5.4 (high) at 1625. Grok 4.3 underperforms significantly—dropping from 1512 (Grok 4.20 Beta 0309) to 1419. Other models show improvement: GLM-5.1 jumps from 1536 to 1573, Kimi K2.6 from 1520 to 1568, DeepSeek V4 Pro from 1438 to 1517, and Xiaomi MiMo V2.5 Pro from 1459 to 1553. Mistral Medium 3.5 High Reasoning enters at 1412, ahead of Mistral Large 3 (1299). Tencent Hy3 Preview lands at 1481. These results highlight which models excel in structured debate and adversarial reasoning—a key proxy for logical consistency and persuasion.
- Opus 4.7 maintains the top spot with a Bradley-Terry score of 1711, far ahead of the pack.
- GPT-5.5 (high reasoning) scores 1574, underperforming GPT-5.4 (high) at 1625.
- Grok 4.3 drops sharply from 1512 (Grok 4.20 Beta) to 1419, a significant regression.
Why It Matters
Debate benchmarks reveal adversarial reasoning strengths, crucial for AI alignment and real-world persuasion tasks.