ARBITER fixes AI reasoning failures with hidden patterns
New method recovers 22% of correct answers lost in majority voting...
ARBITER is a model-agnostic method that identifies and corrects 'wrong-majority failures' in language model reasoning by modeling interactions between reasoning trajectory basins. On GSM8K with Qwen3-4B, consensus over 24 samples achieves around the mid-94% range, and ARBITER recovers a subset of the gap to a top-2 oracle. On Llama-3.1-8B MMLU-HS-Math, it improves accuracy from the mid-78% range to the mid-82% range, recovering about 22% of available oracle headroom. Across three model families and three math benchmarks, it yields consistent gains with no net-negative cases.
- ARBITER identifies 'reasoning basins'—clusters of trajectories converging to the same answer—where majority voting fails by selecting the most stable over the most accurate basin
- Achieves 4% accuracy gains on GSM8K (Qwen3-4B) and 22% recovery of oracle headroom on MMLU-HS-Math (Llama-3.1-8B) using only the model's own outputs
- Model-agnostic and parameter-free in its simplest form (ARBITER-Δ), requiring no external training data or fine-tuning
Why It Matters
ARBITER exposes and fixes a hidden flaw in how AI models reach consensus, enabling more reliable reasoning without costly retraining or additional data.