SpecRef hybrid decoding boosts code accuracy 20% by fixing structure
Training-free hybrid model reveals code benchmarks measure structure, not logic
A new paper on arXiv (ID: 2606.27474) introduces SpecRef, a training-free hybrid decoding strategy that blends autoregressive (AR) drafts with masked diffusion models using entropy-guided selective masking. The goal: improve generation quality without additional training. SpecRef was evaluated on six benchmarks (HumanEval, MBPP, GSM8K, BBH, ARC-Challenge, HellaSwag) using three protocols (execution-based pass@1, exact-match, log-likelihood). Key findings reveal that code benchmarks conflate structural discovery with logical correctness—providing a syntactic scaffold lifted accuracy from near zero to over 20% without changing the underlying model. This suggests many baseline failures are due to structure, not reasoning ability.
The paper also uncovers a "refinement tension" phenomenon: multi-stage correction degrades already-correct tokens, exposing hidden saturation ceilings in benchmarks that single-model evaluations miss. Additionally, log-likelihood and generative evaluation produce different model rankings for the same model pair, indicating they measure distinct capabilities. Finally, standard Python post-processing silently breaks code evaluation for non-autoregressive generators, a trap that undermines many published results. These findings apply broadly to any multi-stage or non-autoregressive generation pipeline, urging more diagnostic evaluation practices.
- SpecRef combines AR draft with masked diffusion using entropy-guided masking—no training required
- Providing a syntactic scaffold boosts code accuracy from near 0% to over 20%, exposing structural vs logical failures
- Refinement tension shows multi-step correction harms correct tokens; log-likelihood vs generative eval rank models differently
Why It Matters
For AI engineers: this paper exposes hidden evaluation flaws and offers a free hybrid decode that fixes structural errors.