Research & Papers

ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces

New routing system uses self-consistency variance to avoid costly full ensembling on 54.2% of tasks.

Deep Dive

A new research paper introduces ACAR (Adaptive Complexity and Attribution Routing), a framework for intelligently routing tasks across multiple large language models to optimize performance and cost. Developed by researcher Ramchand Kumaresan and tested on 1,510 tasks across four benchmarks (MathArena, Reasoning Gym, LiveCodeBench, and SuperGPQA), ACAR uses self-consistency variance (sigma) computed from just three probe samples to determine whether to use single-model, two-model, or three-model execution. The system runs on TEAMLLM, a deterministic execution substrate that provides complete, auditable decision traces for all 7,550 experimental runs.

The results show ACAR achieved 55.6% accuracy using Claude Sonnet 4, GPT-4o, and Gemini 2.0 Flash, outperforming a two-model baseline (54.4%) while avoiding the computational expense of full three-model ensembling on 54.2% of tasks. The research also documents critical negative findings: retrieval augmentation reduced accuracy by 3.4 percentage points due to poor semantic alignment, and the 'agreement-but-wrong' failure mode—where models confidently agree on incorrect answers—creates an intrinsic accuracy ceiling approximately eight percentage points below full ensembling. These findings establish falsifiable baselines for future work on practical multi-model systems that require both efficiency and auditability.

Key Points
  • Achieved 55.6% accuracy across 1,510 benchmark tasks using Claude, GPT-4o, and Gemini
  • Avoided full three-model ensembling on 54.2% of tasks using sigma-based routing
  • Documented critical failure modes: retrieval hurt accuracy by 3.4% and 'agreement-but-wrong' creates accuracy ceiling

Why It Matters

Provides a practical framework for building cost-effective, auditable multi-AI systems while documenting what techniques actually fail in production.