Predicting Program Correctness By Ensemble Semantic Entropy
New 'Ensemble Semantic Entropy' technique slashes false positives and cuts compute costs by nearly 65%.
A research team led by Yunxiang Wei has introduced a novel technique called Ensemble Semantic Entropy (ESE) to tackle a critical flaw in AI code generation. While large language models (LLMs) like GPT-4 and Claude can write programs, verifying their correctness without running them is difficult. Current methods rely on a single model's consistency, but a model can be consistently wrong. ESE solves this by measuring the semantic agreement across an ensemble of different models, creating a more reliable proxy for program correctness.
Experiments on the LiveCodeBench benchmark show ESE's superiority. It correlates more strongly with actual correctness than single-model approaches. Crucially, in selective generation tasks with tight false-positive rate constraints, ESE improves prediction accuracy by 53.4%. The team also built a practical application called 'Cas,' a cascading test-time scaling framework. By using ESE's confidence signal to decide when to use simpler or more complex models, Cas maintains high performance while reducing computational cost (measured in FLOPs) by 64.9% compared to scaling up a single model.
This work provides a new, efficient perspective on balancing model parameter size with inference-time scaling. It moves beyond simply making a single model larger and instead focuses on intelligently orchestrating multiple models based on the certainty of their output. The method offers a direct path to making AI coding assistants more reliable and cost-effective for developers and enterprises.
- ESE aggregates outputs from multiple AI models to better predict code correctness, improving accuracy by 53.4% in constrained tasks.
- The derived 'Cas' framework uses ESE to dynamically route tasks, cutting computational costs (FLOPs) by 64.9% versus single-model scaling.
- The research addresses a key LLM weakness: a single model can be confidently incorrect, making traditional consistency checks unreliable.
Why It Matters
This makes AI coding assistants more trustworthy and cheaper to run, directly impacting developer productivity and cloud costs.