Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search
New research reveals how many searches AI agents really need before costs outweigh accuracy gains.
Researchers Kyle McCleary and James Ghawaly have published a comprehensive study titled "Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search," accepted at the 2026 Language Resources and Evaluation Conference. The paper introduces Budget-Constrained Agentic Search (BCAS), a novel evaluation harness that systematically measures how search depth, retrieval strategy, and token budgets affect the performance and cost of agentic Retrieval-Augmented Generation (RAG) systems. Unlike typical benchmarks, BCAS imposes explicit, real-world constraints on tool calls and completion tokens, forcing AI agents to work within fixed budgets—a critical consideration for production deployments.
Using BCAS, the team ran controlled experiments across six different large language models (LLMs) and three established question-answering datasets. Their key finding is that while accuracy improves with additional search iterations, it quickly plateaus after just 2-3 searches, meaning further costly API calls yield diminishing returns. The most effective retrieval strategy proved to be a hybrid approach combining lexical (keyword) and dense (semantic) search, followed by a lightweight re-ranking step. This configuration delivered the largest accuracy gains across their ablation tests.
Furthermore, the study reveals that allocating a larger budget for the final answer generation (completion tokens) is only beneficial for complex, multi-hop reasoning tasks like those in the HotpotQA benchmark. For simpler factual queries, a smaller completion budget suffices. The research concludes with practical, data-backed recommendations for engineers configuring agentic pipelines, emphasizing that optimal performance is not about maximizing resources but strategically allocating a limited budget. The accompanying open-source prompts and evaluation settings aim to make these findings reproducible and actionable for the AI community.
- Accuracy plateaus after 2-3 search iterations in agentic RAG, making further API calls cost-ineffective.
- Hybrid retrieval (lexical + dense) with lightweight re-ranking produced the best accuracy-cost trade-off in tests.
- Larger completion budgets only improve performance on complex synthesis tasks like HotpotQA, not simple fact retrieval.
Why It Matters
Provides data-driven, cost-optimized blueprints for engineers deploying production AI agents, directly impacting operational budgets and system design.