DeepSeek-R1 pairs crush multi-agent coding tests, LLaMA 3.2 and Qwen3 show role alignment
In a controlled test of 12 LLM pairs on a simple Fibonacci coding task, only DeepSeek-R1 teams converged correctly from the start—every other open-source model either showed role alignment but diverged or never converged at all, revealing a deep gap between model capability and multi-agent reliability.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
The promise of autonomous software engineering rests on multiple agents—a Designer to plan, a Programmer to code—collaborating seamlessly. A recent study by researchers at Chalmers University of Technology and Ericsson put this vision to the test, evaluating 12 pairs drawn from 7 open-source models on a coding game that required generating Fibonacci numbers. The results were stark: only pairs using DeepSeek-R1 sustained correct convergence from the first iteration. Models like LLaMA 3.2 and Qwen3 exhibited what the researchers call "role alignment"—they understood their designated roles—but still diverged after a few steps. The majority of model pairs never converged at all, underscoring a persistent fragility in multi-agent coordination.
This finding lands at a moment when the market for AI-powered software engineering tools is projected to exceed $30 billion by 2028. Startups like Cognition AI, which raised $175 million at a $2 billion valuation for its agent Devin, are betting that multi-agent workflows can automate complex coding tasks. Frameworks such as Microsoft's AutoGen and earlier systems like ChatDev (2023) and MetaGPT (2023) have popularized the concept, yet the current study shows that even relatively recent open-source models struggle with basic convergence on a narrow task. Industry observers have noted that if leading open-source models cannot reliably coordinate on a simple Fibonacci game, the path to production-level autonomous coding is far longer than many assume.
The deeper implication is not that open-source models are broken, but that multi-agent coordination imposes fundamentally different requirements than single-agent performance. DeepSeek-R1's success points to the importance of reasoning chains that maintain consistency across iterative feedback loops—a feature that reasoning-optimized models are beginning to master. However, the study also carries significant caveats: it tested only one specific architecture (Designer+Programmer), a simple task, and a limited set of open-source models. Proprietary models like GPT-4o, Claude 3.5, and Gemini were absent, so the results do not generalize to the entire landscape. Additionally, the paper did not explore variations in prompts, feedback mechanisms, or memory augmentation—factors that could substantially improve convergence.
For enterprises evaluating multi-agent coding tools, the bottom line is clear: the technology is not yet mature for production. The divergence problem is a real blocker, not a minor bug. Until models can reliably sustain correct coordination over multiple iterations—even on trivial tasks—relying on them for complex, multi-step software engineering remains a gamble. The burden is on model developers and framework designers to prove convergence at scale, not just single-agent prowess.
- DeepSeek-R1 stands out as the only open-source model pair that reliably converges on a simple multi-agent coding task, highlighting the value of reasoning-optimized architectures.
- The majority of open-source model pairs fail to converge, suggesting that multi-agent coordination remains a weak point even as single-agent performance improves.
- The $30 billion market for AI software engineering tools may face adoption delays unless models can demonstrate consistent multi-agent reliability beyond simple benchmarks.
Why It Matters
Multi-agent coordination is the next frontier for autonomous coding, but current open-source models show it remains fragile.