97% reduction in token usage and up to 6x decrease in API calls compared to iterative re-prompting methods?

97% reduction in token usage and up to 6x decrease in API calls compared to iterative re-prompting methods.

Inference time reduced from several minutes to milliseconds on MMLU, MMLU-Pro, GPQA, and MedMCQA benchmarks?

Inference time reduced from several minutes to milliseconds on MMLU, MMLU-Pro, GPQA, and MedMCQA benchmarks.

Asymmetric damping mechanism prevents anchor corruption, protecting high-accuracy models from weaker ensemble members?

Asymmetric damping mechanism prevents anchor corruption, protecting high-accuracy models from weaker ensemble members.

Research & Papers

Graph-Based Belief Propagation Cuts Token Use 97% for Multi-LLM Aggregation

arXiv cs.GT June 02, 2026

⚡Combine expert LLMs without extra inference calls—speed boost from minutes to milliseconds.

Deep Dive

A new arXiv paper introduces a fundamentally different approach to combining specialized Large Language Models (LLMs). Current ensemble methods like iterative re-prompting or cross-model refinement are computationally expensive and slow, often requiring repeated LLM calls that degrade performance when weaker models contaminate strong ones (anchor corruption). The authors propose representing each LLM as a variable node in a bipartite factor graph, with check nodes that assess consistency across diverse epistemic criteria. A message-passing protocol inspired by error-recovery systems resolves disagreements, while an asymmetric damping mechanism protects high-reliability anchor nodes from being overridden by majority noise. Crucially, the framework operates on output distributions only, requiring zero additional LLM calls during refinement.

Tested on four benchmarks—MMLU, MMLU-Pro, GPQA, and MedMCQA—the method achieves a 97% reduction in token usage and up to a 6x decrease in API calls. Inference time drops from several minutes to mere milliseconds, all while consistently outperforming leading multi-agent baselines. The results suggest that graph-based belief propagation offers a robust, high-speed, and scalable alternative to current multi-LLM systems. The full pipeline and code will be made public, which could enable real-time deployment of diverse expert models without costly re-prompting overhead.

Key Points

97% reduction in token usage and up to 6x decrease in API calls compared to iterative re-prompting methods.
Inference time reduced from several minutes to milliseconds on MMLU, MMLU-Pro, GPQA, and MedMCQA benchmarks.
Asymmetric damping mechanism prevents anchor corruption, protecting high-accuracy models from weaker ensemble members.

Why It Matters

Makes multi-LLM ensembles practical for real-time applications, slashing cost and latency dramatically.

Read Original Article

Graph-Based Belief Propagation Cuts Token Use 97% for Multi-LLM Aggregation

Why It Matters

Related Articles

🚀 Stay Ahead in AI