Higress-RAG: A Holistic Optimization Framework for Enterprise Retrieval-Augmented Generation via Dual Hybrid Retrieval, Adaptive Routing, and CRAG
New architecture combines dual retrieval, adaptive routing, and CRAG to cut hallucinations and latency.
Researcher Weixi Lin has published a paper detailing Higress-RAG, a novel, enterprise-centric framework designed to overcome the major bottlenecks preventing Retrieval-Augmented Generation (RAG) systems from moving from proof-of-concept to production. The framework addresses three persistent challenges: low retrieval precision for complex queries, high hallucination rates in generation, and unacceptable latency for real-time applications. It proposes a 'Full-Link Optimization' strategy built upon the Model Context Protocol (MCP), orchestrating a sophisticated pipeline that includes adaptive routing, semantic caching, hybrid retrieval, and Corrective RAG (CRAG).
The technical implementation introduces key innovations like the Higress-Native Splitter for structure-aware data ingestion and applies Reciprocal Rank Fusion (RRF) to merge signals from both dense and sparse retrieval methods. A standout feature is its 50ms-latency Semantic Caching mechanism with dynamic thresholding, which is critical for real-time performance. Experimental evaluations on Higress's own technical documentation demonstrate the system's robustness. By optimizing the entire retrieval lifecycle—from pre-retrieval query rewriting to post-retrieval corrective evaluation—Higress-RAG presents a scalable, production-ready architecture aimed at making enterprise AI deployments more reliable and efficient.
- Architecture built on Model Context Protocol (MCP) with a 'Full-Link Optimization' strategy for end-to-end performance.
- Features 50ms-latency Semantic Caching and dual hybrid retrieval using Reciprocal Rank Fusion (RRF).
- Integrates Corrective RAG (CRAG) to actively reduce hallucination rates in the final generation phase.
Why It Matters
Provides a blueprint for moving enterprise RAG systems from fragile prototypes to scalable, low-latency production deployments.