New RAG test reveals content density yields as low as 8% for some corpora
A user's tiered chunking experiment shows 32% yield for HubSpot vs 8% for KPMG...
A technical experiment on Reddit tested a tiered + page-role-aware RAG retrieval strategy across three diverse production corpora: Intercom (help center), HubSpot (case studies), and KPMG (positioning prose). The user classified chunks into HIGH, MEDIUM, LOW, and REJECTED tiers based on operational score. Intercom produced 96 HIGH chunks (31% yield), mostly from help-center articles, while HubSpot delivered 40 HIGH chunks (32% yield) from concrete case studies like “23% increase in ACV.” In contrast, KPMG yielded only 3 HIGH chunks (8% yield) because its entire corpus is dense positioning prose. Despite the thin corpus, semantic retrieval correctly routed queries: “Family business succession” hit /private-enterprise.html with cosine 0.721, and “ESG and climate risk” hit /our-insights/esg.html with 0.794.
Tier weighting (HIGH × 1.20 multiplier) meaningfully reshuffled top-k results. On one query, a 0.535-cosine HIGH chunk was reranked above 0.6+ LOW chunks (weighted 0.642 vs 0.51-0.59). The key takeaway is that a simple “yield score” (HIGH+MEDIUM chunks / total chunks) serves as useful telemetry to predict before generation which brands will require softer claims and more swap-resistant phrasing. The user questions why most RAG benchmarks assume uniformly substantive source material—wildly untrue in the wild. This insight is critical for anyone building production RAG systems, revealing that corpus-quality awareness can significantly affect answer accuracy and retrieval reliability.
- Yield scores ranged from 31% (Intercom) to 32% (HubSpot) to just 8% (KPMG) based on content density.
- Semantic routing on KPMG's thin corpus still achieved cosine scores of 0.656–0.794 for niche queries.
- Tier weighting (HIGH × 1.20) reranked a 0.535-cosine HIGH chunk above 0.6+ LOW chunks, shifting top-k composition.
Why It Matters
This exposes that RAG systems need corpus-quality awareness to avoid overconfident answers on thin content.