Budget-Constrained Online Retrieval-Augmented Generation: The Chunk-as-a-Service Model
Say goodbye to opaque RaaS billing — CaaS only charges for relevant chunks.
Retrieval-Augmented Generation (RAG) has become essential for grounding LLMs with external data, but the dominant RAG-as-a-Service (RaaS) model charges per prompt regardless of whether the retrieved chunks are actually relevant. This opaque pricing inflates costs and wastes budgets on low-quality retrievals. A new paper from researchers at multiple institutions introduces Chunk-as-a-Service (CaaS), which flips the model: you only pay for the chunks that are contextually relevant to your query. CaaS comes in two flavors: Open-Budget (OB-CaaS), which enriches every prompt, and Limited-Budget (LB-CaaS), which uses a novel online algorithm to selectively enrich prompts under a fixed budget.
The core innovation is the Utility-Cost Online Selection Algorithm (UCOSA), which decides in real time whether to fetch and pay for a chunk based on a utility-cost trade-off. Experiments show UCOSA delivers 52% better performance than random selection (measured by number of enriched prompts times average relevance) and reaches about 75% of the performance of idealized offline selection that knows future queries. More importantly, cost efficiency skyrockets: LB-CaaS achieves 140% and OB-CaaS 86% higher performance-to-budget ratios compared to standard RaaS. For enterprise teams running high-volume RAG pipelines, CaaS could dramatically lower costs while maintaining — or even improving — answer quality.
- CaaS charges per relevant chunk retrieved instead of per prompt, increasing cost transparency.
- UCOSA algorithm selects which prompts to enrich online, balancing utility and cost under budget limits.
- Compared to RaaS, LB-CaaS achieves 140% and OB-CaaS 86% higher performance-to-budget ratios.
Why It Matters
Makes RAG more affordable for enterprises with tight budgets while maintaining high relevance.