LLM Teams Boost Quiz Accuracy by 20 Points in New Study
Teamwork makes the dream work for LLMs, boosting quiz accuracy by up to 20%.
A new paper by Kotelnikova et al. (accepted at Dialogue-2026) explores whether LLM teams outperform single models on complex reasoning tasks. Using six recent open-source LLMs, they created teams to answer questions from the Russian quiz game 'What? Where? When?' (ChGK), which demands indirect reasoning and cultural knowledge. The team designed three interaction strategies: Voting (majority rule), Silent Team (captain sees final answers only), and Talkative Team (captain sees both answers and rationales). On a dataset of 572 2025 questions, team-based approaches consistently beat single models, with gains of up to 20 percentage points in accuracy. The best team achieved 44.23% accuracy, nearing human teams on questions with available statistics. Interestingly, the study found that inter-model disagreement strongly predicted lower accuracy, but explanatory communication (the Talkative strategy) substantially mitigated those performance drops.
Further analysis of captain behavior revealed no self-preference bias—captains did not favor their own initial answers over peers' rationales. Access to peer reasoning improved captain judgments, suggesting that LLM teams act primarily as answer selection and error-filtering mechanisms rather than generating novel solutions. The authors argue that adaptive strategies—where interaction style changes based on task difficulty—represent a promising direction for multi-agent systems. This research highlights that collaboration and communication, not just scale or parameter count, can significantly boost LLM reasoning capabilities in complex, knowledge-intensive domains.
- Three team strategies tested: Voting, Silent Team (captain sees answers), and Talkative Team (captain sees answers + rationales).
- Best team achieved 44.23% accuracy on 572 ChGK questions, up to 20 percentage points over single-model baselines.
- Explanatory communication between models reduced accuracy drops caused by inter-model disagreement.
Why It Matters
Demonstrates that multi-agent collaboration with communication can drastically improve LLM reasoning on complex tasks.