DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation
New architecture tackles RAG's biggest flaw: messy, multi-step queries on complex documents.
A team of researchers including Valeriy Kovalskiy and Nikita Belov has published a new paper on arXiv titled 'DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation.' The work addresses a critical weakness in current RAG (retrieval-augmented generation) systems, which are used to ground large language models in external knowledge. The authors identify that 'Naive RAG' pipelines often fail when dealing with heterogeneous corpora (mixed document types) and multi-step queries, leading to degraded answer quality due to flat knowledge representations and a lack of explicit workflows.
Their solution is the DCD framework, which imposes a hierarchical structure—Domain, Collection, Document—on the information space. This design enables controlled, multi-stage query processing where the system progressively narrows the retrieval and generation scope. Instead of searching a vast, undifferentiated document pool, a query is first routed to a relevant domain, then a specific collection within it, and finally to precise documents. The 11-page paper outlines an architecture complemented by smart chunking, hybrid retrieval methods, and integrated validation guardrails.
The proposed method represents a shift from treating a knowledge base as a simple 'document dump' to managing it as a structured, navigable space. By adding this layer of orchestration before the LLM generates a final answer, DCD aims to significantly boost the robustness, factual accuracy, and relevance of RAG systems in applied scenarios. The researchers have provided links to a Hugging Face repository and Git code, indicating practical implementation resources are available for developers looking to build more reliable enterprise-grade AI assistants.
- Proposes a three-tier hierarchical design (Domain-Collection-Document) to structure knowledge for RAG systems, moving beyond flat document storage.
- Uses multi-stage routing based on structured model outputs to progressively restrict retrieval scope, improving accuracy for complex, multi-step queries.
- Architecture includes smart chunking, hybrid retrieval, and validation guardrails, aiming to solve quality degradation in 'Naive RAG' pipelines on heterogeneous data.
Why It Matters
Enables more reliable AI assistants for enterprises by making RAG systems significantly more accurate and robust with complex, real-world documents.