Study: Multi-agent LLM code generation architectures add complexity without accuracy gains
Complex multi-agent LLM setups can produce 130% more complex code with no accuracy benefit.
A new study from Nazmus Ashrafi (arXiv:2606.00308) systematically examines how multi-agent LLM code generation architectures affect code complexity—not just functional correctness. Comparing six configurations (Basic, AC, ACT, Debugger, AC+Debugger, ACT+Debugger) across all 164 HumanEval tasks using two GPT-4o family models (1,968 paired observations), the research applies five RADON complexity metrics (SLOC, cyclomatic complexity, Halstead Volume, Difficulty, and Effort) with a rigorous non-parametric statistical pipeline.
The results reveal that the six architectures collapse into two indistinguishable complexity clusters separated by a 50-130% gap, consistent across both models and whether analyzing all completions or only passing ones. The analyst-coder split is the primary driver of complexity inflation; the runtime debugger actually deflates complexity on an analyst-coder background, while the tester re-inflates it. Most importantly, the heavy cluster's additional complexity yields zero pass@1 advantage: the leanest architectures (Basic, AC) match or beat the heaviest (ACT+Debugger) on accuracy. The paper concludes that architectural elaboration should be justified by measured benefit, not assumed.
- Six multi-agent LLM architectures collapse into two complexity clusters with a 50-130% gap, consistent across GPT-4o models and conditions.
- Analyst-coder split inflates code complexity; runtime debugger deflates it; tester re-inflates it on an analyst-coder background.
- Leanest architectures (Basic, AC) match or beat heaviest (ACT+Debugger) on pass@1 accuracy despite producing simpler code.
Why It Matters
Simpler multi-agent designs can achieve equal accuracy with less code complexity, saving compute and maintenance costs.