ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces
New routing system uses self-consistency variance to avoid costly full ensembling on 54.2% of tasks.
A new research paper introduces ACAR (Adaptive Complexity and Attribution Routing), a framework for intelligently routing tasks across multiple large language models to optimize performance and cost. Developed by researcher Ramchand Kumaresan and tested on 1,510 tasks across four benchmarks (MathArena, Reasoning Gym, LiveCodeBench, and SuperGPQA), ACAR uses self-consistency variance (sigma) computed from just three probe samples to determine whether to use single-model, two-model, or three-model execution. The system runs on TEAMLLM, a deterministic execution substrate that provides complete, auditable decision traces for all 7,550 experimental runs.
The results show ACAR achieved 55.6% accuracy using Claude Sonnet 4, GPT-4o, and Gemini 2.0 Flash, outperforming a two-model baseline (54.4%) while avoiding the computational expense of full three-model ensembling on 54.2% of tasks. The research also documents critical negative findings: retrieval augmentation reduced accuracy by 3.4 percentage points due to poor semantic alignment, and the 'agreement-but-wrong' failure mode—where models confidently agree on incorrect answers—creates an intrinsic accuracy ceiling approximately eight percentage points below full ensembling. These findings establish falsifiable baselines for future work on practical multi-model systems that require both efficiency and auditability.
- Achieved 55.6% accuracy across 1,510 benchmark tasks using Claude, GPT-4o, and Gemini
- Avoided full three-model ensembling on 54.2% of tasks using sigma-based routing
- Documented critical failure modes: retrieval hurt accuracy by 3.4% and 'agreement-but-wrong' creates accuracy ceiling
Why It Matters
Provides a practical framework for building cost-effective, auditable multi-AI systems while documenting what techniques actually fail in production.