DiffCodeGen beats code gen benchmarks with zero extra LLM calls
Cuts token usage by using behavior clustering instead of LLM voting...
A new paper from Yifeng He and colleagues presents DiffCodeGen, a differential test-time scaling technique that boosts code generation without the usual heavy LLM overhead. The method first produces a diverse set of code candidates using varied sampling and prompting strategies. Instead of relying on public test cases or expensive LLM-based voting, DiffCodeGen automatically synthesizes input tests via coverage-guided fuzzing. It runs all candidates on these inputs, capturing their execution behavior, then clusters them by similarity. The final answer is the medoid of the largest cluster—no additional LLM calls needed.
Evaluated across four different large language models, DiffCodeGen consistently outperforms standard baselines and matches or surpasses state-of-the-art test-time scaling methods. The key advantage is efficiency: it consumes only a fraction of the time and tokens because selection happens purely through runtime behavior analysis. The method is fully asynchronous and model-agnostic, making it ideal for modern agentic coding pipelines. It can also be combined with reasoning models to push performance even further. This opens the door to more scalable and cost-effective code generation in production.
- DiffCodeGen uses coverage-guided fuzzing to generate test inputs—no existing tests or LLM calls needed.
- Code candidates are clustered by execution behavior; the medoid of the largest cluster is selected.
- Across 4 LLMs, it matches or beats prior methods while using significantly fewer tokens and wall-clock time.
Why It Matters
Cuts code gen costs by eliminating expensive LLM voting, making test-time scaling practical for real-world use.