Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension
New multimodal routing protocol for AI agents improves task accuracy by 20 percentage points over text-only systems.
Researcher Vasundra Srinivasan has published a paper introducing MMA2A (Multimodal A2A), a protocol extension that fundamentally changes how AI agents communicate. Current multi-agent systems typically convert all inputs—images, voice, text—into text before routing between agents, creating a "text bottleneck" that loses crucial multimodal context. MMA2A instead inspects "Agent Card" capability declarations to route voice, image, and text data in their native modalities, preserving richer contextual information for downstream reasoning agents.
On the controlled 50-task CrossModal-CS benchmark, MMA2A demonstrated substantial performance gains. Using the same underlying LLM backend and identical tasks, the multimodal routing achieved 52% task completion accuracy compared to just 32% for text-bottleneck baselines—a statistically significant 20 percentage point improvement. The research establishes a crucial two-layer requirement: the protocol-level routing innovation only delivers benefits when paired with capable agent-level reasoning (LLM-backed). When replaced with simple keyword matching, the accuracy gap disappeared entirely (36% vs. 36%).
The performance improvements concentrate heavily on vision-dependent applications. Product defect report accuracy improved by 38.5 percentage points, while visual troubleshooting tasks saw 16.7 point gains. However, this accuracy comes with a computational tradeoff: native multimodal processing incurs a 1.8× latency cost compared to text-only routing. The research positions routing as a first-order design variable in multi-agent systems, determining what information downstream agents can actually reason with.
- MMA2A protocol improves multi-agent task accuracy by 20 percentage points (52% vs 32%) on CrossModal-CS benchmark
- Vision-dependent tasks benefit most: product defect reports improve +38.5 points, visual troubleshooting +16.7 points
- Requires capable LLM reasoning—replacing with keyword matching eliminates all accuracy gains (36% vs 36%)
Why It Matters
Enables more accurate AI agent teams for visual troubleshooting, quality inspection, and multimodal customer service applications.