MASEval: Extending Multi-Agent Evaluation from Models to Systems
New research shows your LangGraph vs. AutoGen decision matters just as much as picking GPT-4 vs. Claude 3.
A team of researchers including Cornelius Emde, Alexander Rubinstein, and others have published MASEval, a new evaluation framework that addresses a critical gap in AI development. While the rapid adoption of LLM-based agentic systems has created a rich ecosystem of frameworks like smolagents, LangGraph, and AutoGen, existing benchmarks remain model-centric. They typically fix the agentic setup and fail to compare other crucial system components. The researchers argue that implementation decisions—including system topology, orchestration logic, and error handling—substantially impact real-world performance, a factor previously overlooked in standardized testing.
MASEval treats the entire multi-agent system as the unit of analysis, providing a framework-agnostic library for systematic, component-level evaluation. In their initial study, the team conducted a comparison across 3 different benchmarks, 3 foundation models, and 3 popular agent frameworks. Their key finding is that the choice of framework matters as much as the choice of the underlying AI model for determining system performance. This shifts the evaluation paradigm from solely judging models to holistically assessing the engineered system built around them.
The framework, released under an MIT license, opens new avenues for both research and practical application. For researchers, it enables the principled exploration of all components within agentic systems, moving beyond model capabilities to system architecture. For practitioners and developers, MASEval provides a tool to empirically identify the best combination of model, framework, and system design for a specific task, moving selection from intuition and hype to data-driven decision making.
- MASEval is a new framework-agnostic evaluation library that assesses entire multi-agent AI systems, not just the underlying LLMs.
- A study using MASEval across 3 benchmarks, 3 models, and 3 frameworks found that framework choice impacts performance as much as model choice.
- The tool allows developers to make data-driven decisions when choosing between systems built with LangGraph, AutoGen, smolagents, and other frameworks.
Why It Matters
Enables data-driven system design, helping teams choose the right agent framework and architecture, not just the best model.