Agent Frameworks

Multi-agent AI oracles boost prediction market accuracy to 83.43%

Independent voting beats debate in test of 1,189 questions.

Deep Dive

A new paper on arXiv (2605.30802) by Tarun Kota explores whether multi-agent AI systems can improve prediction market resolution oracles. Prediction markets rely on resolving outcomes accurately, but current oracles trade off speed for reliability. Kota tested three single-LLM baselines (GPT-5 Nano, DeepSeek V3, Llama-3.3-70B) and two multi-agent architectures—independent aggregation with confidence-weighted voting and deliberative consensus—on 1,189 resolved questions from KalshiBench. All agents shared an evidence layer via Exa with date-filtered retrieval to isolate reasoning from retrieval quality.

Independent aggregation achieved the highest accuracy at 83.43%, outperforming the best single model by 1.01 percentage points. In contrast, deliberative consensus degraded accuracy to ~76%, below every single-model baseline, due to error propagation from confidently wrong models flipping correct ones. Error correlations across models (0.529–0.689) explain why aggregation gains fall short of the theoretical Condorcet ceiling. The paper proposes a hybrid AI-human system: auto-resolving only unanimous, high-confidence questions yields 97.87% accuracy on 47% of the dataset, with disagreement flagging others for human review.

Key Points
  • Independent aggregation with confidence-weighted voting scored 83.43% accuracy on 1,189 KalshiBench questions.
  • Deliberative consensus fell to ~76%, underperforming every single model due to error propagation.
  • Hybrid AI-human oracle achieves 97.87% accuracy on 47% of questions by auto-resolving only unanimous, high-confidence answers.

Why It Matters

Reliable, cost-effective oracle systems could unlock wider adoption of prediction markets for decision-making.