Research & Papers

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

arXiv stat.ML February 27, 2026

⚡New algorithm lets LLMs optimize outputs by comparing their own responses, achieving 20% accuracy gains on math benchmarks.

Deep Dive

A research team from Columbia University, Princeton University, and other institutions has introduced Duel-Evolve, a novel evolutionary algorithm that enables large language models (LLMs) to optimize their own outputs during test time without requiring external reward models or ground-truth labels. The key innovation addresses a fundamental limitation in current test-time optimization methods: many real-world tasks lack reliable, calibrated scoring functions. Instead of depending on external evaluators, Duel-Evolve leverages the LLM's own ability to compare candidate responses through pairwise preferences, which are often easier for models to generate than absolute scores. This approach opens new possibilities for improving LLM performance in domains where reward signals are sparse, noisy, or unavailable.

The technical core of Duel-Evolve combines a Bayesian Bradley-Terry model to aggregate noisy preference comparisons with Double Thompson Sampling to intelligently allocate comparison budgets toward promising candidates. The algorithm operates through an evolutionary process where high-quality 'parent' responses are selected based on these preference-derived quality estimates to generate improved 'offspring' candidates. In evaluations, Duel-Evolve achieved remarkable gains: 20 percentage point accuracy improvements over existing methods on the MathBench benchmark and over 12 percentage point improvements on LiveCodeBench. This demonstrates that LLMs can provide strong optimization signals through self-comparison alone, potentially reducing dependency on expensive human feedback or carefully engineered reward functions for test-time scaling applications.

Key Points

Achieves 20 percentage point accuracy gains on MathBench without external rewards
Uses Bayesian Bradley-Terry model and Double Thompson Sampling for preference aggregation
Improves over comparable methods by 12+ percentage points on LiveCodeBench coding tasks

Why It Matters

Enables LLMs to self-improve in production without costly reward models or human feedback, expanding real-world applicability.

Read Original Article

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Why It Matters

Stay Ahead in AI