Developer Tools

Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation

Simple feedback loops beat complex pipelines for 1-3B code generation models.

Deep Dive

A new study from Charles Junichi McAndrews, published on arXiv, challenges the assumption that complex multi-model pipelines are necessary for improving code generation with small language models (1-3B parameters). The paper, titled 'Feedback Over Form,' systematically compares various pipeline topologies—including generate-execute-refine loops and NEAT-inspired evolutionary structures—against simpler refinement methods. Running all experiments locally on a single laptop using HumanEval (164 problems) and sanitized MBPP (427 problems), the researchers found that self-refinement with execution feedback improved code generation by more than 4 standard deviations on both benchmarks.

The gains were narrow in mechanism: execution feedback primarily fixed runtime errors like NameError and SyntaxError but rarely addressed logic errors such as AssertionError. Surprisingly, the identity of the generator model mattered less than the refiner's capability—a 1.5B generator paired with a 3B refiner matched a single 3B model doing both roles. Early stopping was critical, as every iteration without it was net-negative. Code-specialized models outperformed every general-purpose pipeline configuration, suggesting that model specialization is more impactful than pipeline architecture. Preliminary text-only pipeline experiments without execution feedback showed no gains at this scale. The evolutionary search mostly rediscovered the simple generate-execute-refine loop, with no significant benefit from added topology. Single-evaluation fitness inflated results by 5-7% by selecting lucky genomes over robust ones. The key takeaway: for 1-3B models, execution feedback is far more important than pipeline complexity.

Key Points
  • Self-refinement with execution feedback improved code generation by over 4 standard deviations on HumanEval and MBPP.
  • A 1.5B generator paired with a 3B refiner matched a single 3B model doing both roles.
  • Code-specialized models outperformed all general-purpose pipeline configurations tested.

Why It Matters

Simplifies local code generation with small models: focus on feedback, not complex pipelines, for practical gains.