Review Beats Planning: Dual-Model Interaction Patterns for Code Synthesis
A simple role reversal between two AI models beats GPT-4o and O1 Preview for under $2/hour.
A new research paper by Jan Miller, 'Review Beats Planning: Dual-Model Interaction Patterns for Code Synthesis,' challenges the conventional wisdom of how to chain AI models for coding tasks. The standard 'plan-then-code' approach, where a reasoning model creates a plan for a code specialist to implement, was found to degrade performance by 2.4 percentage points versus using the coder alone. Miller's team discovered that reversing this interaction—letting the code specialist generate a solution first, then having the reasoning model review and refine it—yields dramatically better results. This 'code-then-review' pattern achieved a state-of-the-art 90.2% pass@1 accuracy on the HumanEval+ benchmark, surpassing both OpenAI's GPT-4o (87.2%) and the O1 Preview (89.0%), all while running on roughly $2 per hour of commodity GPU hardware.
The research, validated across 542 problems on HumanEval+ and MBPP+, identified a key moderating variable: the richness of the problem specification. The review-based improvement scales with specification quality, offering a 4x greater boost on richly-specified problems (+9.8 percentage points) compared to lean ones (+2.3 pp), though it remains beneficial in both cases. This finding has two major practical implications for AI developers and enterprises. First, it provides a clear, cost-effective blueprint for composing specialized models by leveraging their innate cognitive strengths. Second, it quantifies the return on investment for creating high-quality specifications, showing that better input documentation directly amplifies the performance gains from this review-based AI workflow.
- The 'code-then-review' method achieved 90.2% pass@1 on HumanEval+, beating GPT-4o (87.2%) and O1 Preview (89.0%).
- The approach runs on ~$2/hr of commodity GPU, proving high performance doesn't require proprietary, expensive models.
- Review effectiveness scales 4x more on rich specifications (+9.8pp gain) than lean ones (+2.3pp), highlighting the value of good docs.
Why It Matters
Provides a cheap, composable blueprint for beating top-tier coding AIs, emphasizing that model interaction design is as crucial as the models themselves.