Llama 8B matching 70B on multi-hop QA with structured prompting, no fine-tuning
A new Graph RAG method boosts small Llama 8B to match 70B's multi-hop QA, cutting costs 12x.
A breakthrough in retrieval-augmented generation (RAG) systems reveals that the primary bottleneck in multi-hop question answering isn't finding information—it's connecting the dots. Research using Graph RAG (specifically KET-RAG) shows that 77-91% of the time, the correct answer exists in the retrieved context, but 73-84% of wrong answers stem from models failing at the reasoning step. This finding flips conventional wisdom about RAG limitations and points directly to reasoning as the critical challenge.
Researchers discovered two inference-time techniques that dramatically close the performance gap between small and large models. First, structured chain-of-thought prompting decomposes complex questions into graph query patterns before answering. Second, graph traversal compresses retrieved context by approximately 60% without additional LLM calls. Together, these methods enable smaller models to focus computational power on reasoning rather than processing massive contexts.
The results are striking: Llama 3.1 8B with these augmentations matches or exceeds the performance of vanilla Llama 3.3 70B on three standard benchmarks—HotpotQA, MuSiQue, and 2WikiMultiHopQA (tested with 500 questions each). This performance parity comes at roughly 12x lower cost when running on platforms like Groq. The techniques have been validated across different systems including LightRAG, confirming their generalizability beyond a single implementation.
This research, detailed in arXiv paper 2603.14045, represents a significant shift in how we approach efficient AI deployment. By optimizing the reasoning pipeline rather than simply scaling model size, developers can achieve state-of-the-art question answering capabilities with dramatically smaller, faster, and cheaper models—opening up advanced AI applications to a much wider range of use cases and budgets.
- Llama 3.1 8B with KET-RAG matches/exceeds vanilla Llama 3.3 70B on multi-hop QA benchmarks
- 73-84% of wrong answers come from reasoning failures, not missing information in context
- Achieves ~12x cost reduction on Groq while maintaining performance parity
Why It Matters
Enables enterprise-grade question answering at consumer hardware prices, dramatically lowering barriers to advanced AI deployment.