Developer Tools

Execution-based code selection beats text voting by 52 percentage points

Run the code, don't just vote on it—new research shows behavioral evidence dominates.

Deep Dive

A new paper from researchers Shan Jiang, Zijian Yi, and Chenguang Zhu, titled "Semantic Voting: Execution-Grounded Consensus for LLM Code Generation," systematically compares 18 configurations across different models, thinking levels, and benchmarks. The study pits traditional output-pattern majority voting against execution-based methods like weighted voting, MBR-Exec, and SemanticVote—which clusters candidates by execution fingerprints on LLM-generated inputs. The results are striking: every execution-based selector outperforms majority voting by at least 18 percentage points, with the best achieving 19–52 pp gains across all configurations.

Perhaps more surprising, once candidates are executed on diverse inputs, the aggregation rule used (SemanticVote, weighted voting, or MBR-Exec) makes no statistically significant difference. The critical factor is input quality: sketch-based input generation outperforms direct LLM generation by 0.6–2.1 pp and random fuzzing by up to 11.3 pp. Deeper thinking improves majority voting by 12 pp but leaves execution-based methods flat or even degraded due to falling candidate diversity. The authors frame the problem as signal-quality rather than aggregation-rule: when no oracle exists, behavioral evidence from running code matters more than how you combine votes.

Key Points
  • Execution-based selectors outperform majority voting by 19–52 percentage points across all 18 configurations tested.
  • Aggregation rule (SemanticVote, weighted voting, MBR-Exec) had no statistically significant effect; input quality drove performance.
  • Sketch-based input generation improved results by up to 11.3 pp over random fuzzing; deeper thinking helped majority voting but hurt execution methods due to reduced diversity.

Why It Matters

For reliable LLM code generation, run candidates on diverse inputs rather than just voting on outputs.