GraphRAG on consumer GPUs: Llama 3.1 tops knowledge graphs, Qwen 2.5 wins on accuracy
Local LLMs under 7B fail GraphRAG; Llama 3.1 builds 1,172 entities on a single 8GB GPU.
A new study from Peter Fernandes and Ria Kanjilal, published on arXiv (2605.20815), systematically evaluates GraphRAG (graph-based retrieval augmented generation) for healthcare EHR schema retrieval using locally deployed, open-source LLMs. The authors implemented Microsoft’s GraphRAG pipeline on real-world EHR documentation and benchmarked four models—Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B)—each running via Ollama on a single consumer GPU with 8GB VRAM. They measured indexing efficiency, knowledge graph construction, query latency, answer quality, and hallucination under both global and local retrieval modes.
Results reveal substantial variation: Llama 3.1 built the richest knowledge graph (1,172 entities), but Qwen 2.5 delivered the best answer quality (3.3/5). Phi-4-mini (3.8B) failed entirely due to structured-output errors, and Mistral exhibited degenerate repetition. Crucially, the team found a practical capacity threshold: models under ~7B parameters cannot reliably produce valid structured outputs and thus cannot complete the GraphRAG pipeline. Local retrieval consistently outperformed global summarization in both latency and factual grounding, with reduced hallucination. These findings prove GraphRAG is feasible on consumer hardware while underscoring the critical importance of model selection and retrieval design for robust deployment in regulated, privacy-sensitive healthcare settings.
- Llama 3.1 (8B) produced the largest knowledge graph with 1,172 entities on a single consumer GPU (8GB VRAM).
- Qwen 2.5 (7B) achieved the highest answer quality score of 3.3/5, while Phi-4-mini (3.8B) failed the pipeline entirely.
- Models under 7B parameters cannot reliably produce valid structured outputs, limiting GraphRAG feasibility to larger local LLMs.
Why It Matters
Proves GraphRAG for private healthcare data is viable on cheap hardware, but only with models ≥7B parameters.