Evaluating Multi-Agent LLM Architectures for Rare Disease Diagnosis
A new study of 302 rare disease cases finds that adding more AI agents doesn't always improve diagnostic reasoning.
A new study by researcher Ahmed Almasoud, published on arXiv, rigorously tests whether using multiple AI agents in concert improves diagnostic accuracy for rare diseases. The research evaluated four distinct agent topologies—a single Control agent, a Hierarchical system, a Collaborative group, and an Adversarial setup—across 302 complex medical cases spanning 33 rare disease categories. The key finding was that a Hierarchical multi-agent architecture achieved the highest accuracy at 50.0%, only marginally outperforming a single, robust LLM agent (48.5%) and a Collaborative system (49.8%).
However, the study delivered a crucial counterintuitive insight: increasing system complexity does not guarantee better performance. The Adversarial model, where agents critique each other's reasoning, catastrophically degraded accuracy to just 27.3%. The researchers introduced a 'Reasoning Gap' metric, which revealed this setup often generated correct internal knowledge but then artificially doubted and rejected valid final diagnoses. Performance also varied significantly by disease category, with multi-agent systems excelling in Bone and Thoracic diseases but struggling with Cardiac and Respiratory cases.
These results have significant implications for how developers build AI for high-stakes fields like medicine. The paper argues against a one-size-fits-all approach to multi-agent design, instead advocating for dynamic, context-aware topology selection. The work provides a quantitative framework for evaluating not just if an AI system is accurate, but how its internal reasoning architecture influences that accuracy, paving the way for more reliable and transparent diagnostic tools.
- Hierarchical multi-agent LLMs achieved top accuracy of 50.0% across 302 rare disease cases, just 1.5% above a single agent.
- Adversarial agent architecture crashed performance to 27.3%, revealing a major 'Reasoning Gap' where correct diagnoses were rejected.
- Performance varied wildly by disease type, with multi-agent systems showing clear superiority only in Bone and Thoracic categories.
Why It Matters
For builders of medical AI, this shows that smarter system design, not just more agents, is critical for reliable, high-stakes applications.