Core-based Hierarchies for Efficient GraphRAG
New method replaces Leiden clustering with deterministic k-core decomposition, solving reproducibility issues in sparse knowledge graphs.
Researchers Jakir Hossain and Ahmet Erdem Sarıyüce have published a paper introducing a novel, more efficient framework for GraphRAG (Graph-based Retrieval-Augmented Generation). The core innovation replaces the commonly used Leiden clustering algorithm for community detection with k-core decomposition, a deterministic, density-aware method that runs in linear time. This addresses a critical flaw in existing GraphRAG systems: on sparse knowledge graphs, where most nodes have few connections, Leiden's modularity optimization can produce exponentially many near-optimal partitions, making community structures non-reproducible and unstable. The new method ensures 100% reproducible hierarchies, which is essential for reliable document analysis across domains like finance, news, and podcast transcripts.
The technical approach leverages the k-core hierarchy with lightweight heuristics to build size-bounded, connectivity-preserving communities for retrieval and summarization. A key component is a token-budget-aware sampling strategy that intelligently selects information for the LLM, significantly reducing inference costs. The researchers evaluated their system using three different LLMs for answer generation and five independent LLM judges on real-world datasets. Results consistently showed improvements in answer comprehensiveness and diversity while using fewer tokens, proving the framework's effectiveness for 'global sensemaking'—tasks requiring reasoning across many documents. This work provides a more robust and cost-effective foundation for enterprise RAG applications that need to analyze large, interconnected document collections.
- Replaces non-deterministic Leiden clustering with linear-time k-core decomposition for 100% reproducible community hierarchies in sparse graphs.
- Introduces token-budget-aware sampling heuristics that reduce LLM inference costs while maintaining or improving answer quality.
- Validated on financial earnings, news, and podcast data, showing consistent gains in answer comprehensiveness and diversity across multiple LLM judges.
Why It Matters
Enables reliable, cost-effective analysis of massive document sets for finance, intelligence, and research, moving beyond simple Q&A to true global sensemaking.