Research & Papers

Budget-Constrained Online Retrieval-Augmented Generation: The Chunk-as-a-Service Model

arXiv cs.IR May 01, 2026

⚡Say goodbye to opaque RaaS billing — CaaS only charges for relevant chunks.

Deep Dive

Retrieval-Augmented Generation (RAG) has become essential for grounding LLMs with external data, but the dominant RAG-as-a-Service (RaaS) model charges per prompt regardless of whether the retrieved chunks are actually relevant. This opaque pricing inflates costs and wastes budgets on low-quality retrievals. A new paper from researchers at multiple institutions introduces Chunk-as-a-Service (CaaS), which flips the model: you only pay for the chunks that are contextually relevant to your query. CaaS comes in two flavors: Open-Budget (OB-CaaS), which enriches every prompt, and Limited-Budget (LB-CaaS), which uses a novel online algorithm to selectively enrich prompts under a fixed budget.

The core innovation is the Utility-Cost Online Selection Algorithm (UCOSA), which decides in real time whether to fetch and pay for a chunk based on a utility-cost trade-off. Experiments show UCOSA delivers 52% better performance than random selection (measured by number of enriched prompts times average relevance) and reaches about 75% of the performance of idealized offline selection that knows future queries. More importantly, cost efficiency skyrockets: LB-CaaS achieves 140% and OB-CaaS 86% higher performance-to-budget ratios compared to standard RaaS. For enterprise teams running high-volume RAG pipelines, CaaS could dramatically lower costs while maintaining — or even improving — answer quality.

Key Points

CaaS charges per relevant chunk retrieved instead of per prompt, increasing cost transparency.
UCOSA algorithm selects which prompts to enrich online, balancing utility and cost under budget limits.
Compared to RaaS, LB-CaaS achieves 140% and OB-CaaS 86% higher performance-to-budget ratios.

Why It Matters

Makes RAG more affordable for enterprises with tight budgets while maintaining high relevance.

Read Original Article

Budget-Constrained Online Retrieval-Augmented Generation: The Chunk-as-a-Service Model

Why It Matters

Stay Ahead in AI