Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking
LLMs disagree on API rankings 50% of the time, study finds...
A new study from the University of Washington, presented at the AAAI 2026 LAMAS Workshop, systematically measures how different large language models (LLMs) disagree when deciding which APIs to call for a given task. Lead author Eyhab Al-Masri tested 5 major model families across 15 canonical API domains, using metrics like Average Overlap (AO ~0.50), Jaccard similarity, Rank-Biased Overlap, Kendall's tau (~0.45), and Kendall's W to quantify pairwise and group-level agreement. The results reveal that while models show moderate overall alignment, their agreement varies dramatically by task type: structured tasks like Weather and Speech-to-Text are stable, while open-ended tasks like Sentiment Analysis exhibit far higher divergence and ranking volatility.
The study identifies a critical hidden failure mode in multi-agent LLM coordination: apparent agreement on surface-level outputs can mask deep instability in action-relevant API rankings. This means that even when models appear to agree, they may be selecting different tools or sequences for the same task, creating unpredictable behavior in autonomous systems. The authors propose consensus weighting as a reliability-aware orchestration mechanism to improve coordination among heterogeneous LLMs. However, they warn that current benchmarks fail to capture this hidden divergence, which poses a pre-deployment safety risk. The findings motivate the need for diagnostic benchmarks that detect ranking instability early, especially for abstract reasoning tasks where divergence is highest.
- Average Overlap across models is only ~0.50, meaning they agree on API selection just half the time
- Structured tasks (Weather, Speech-to-Text) show high stability, while open-ended tasks (Sentiment Analysis) cause high divergence
- Hidden ranking instability can persist even when surface-level outputs appear consistent
Why It Matters
This hidden ranking divergence is a safety risk for autonomous multi-agent systems that rely on LLMs for API orchestration.