Research & Papers

Calibrated Fusion for Heterogeneous Graph-Vector Retrieval in Multi-Hop QA

A new calibration technique fuses graph and vector search scores to improve complex question answering accuracy.

Deep Dive

A new research paper by Andre Bacellar tackles a core challenge in advanced retrieval-augmented generation (RAG) systems: how to properly combine scores from fundamentally different search methods. Modern systems for multi-hop question answering often use both dense vector similarity (semantic search) and graph-based relevance signals like Personalized PageRank (PPR) to find evidence. However, these scores come from different distributions and aren't directly comparable, making fusion unstable and suboptimal.

Bacellar's method, called PhaseGraph, frames this as a score calibration problem. It maps both vector and graph scores to a common, unit-free scale using percentile-rank normalization (PIT) before fusion, preserving magnitude information. On established benchmarks like MuSiQue and 2WikiMultiHopQA—used in systems like HippoRAG2—this calibrated fusion led to statistically significant gains in held-out last-hop retrieval accuracy. LastHop@5 improved from 75.1% to 76.5% on MuSiQue and from 51.7% to 53.6% on 2WikiMultiHopQA.

The research also included a theory-driven ablation study, showing that percentile-based calibration is more directionally robust than simpler min-max normalization. Interestingly, after calibration, the exact fusion operator (like linear fusion or Boltzmann weighting) mattered less, suggesting that the key innovation is the commensuration of scores itself. This work provides a practical, data-driven solution for developers building more reliable RAG pipelines that need to leverage heterogeneous information sources.

Key Points
  • PhaseGraph uses percentile-rank normalization (PIT) to calibrate graph and vector scores before fusion, improving LastHop@5 accuracy by 1.4% on MuSiQue and 1.9% on 2WikiMultiHopQA.
  • The method addresses the heterogeneous score distribution problem between dense similarity search and graph-based signals like Personalized PageRank (PPR).
  • Ablation studies show percentile calibration is more robust than min-max normalization, and the choice of fusion operator becomes less critical after scores are properly calibrated.

Why It Matters

Provides a more reliable foundation for complex RAG and agentic AI systems that need to synthesize evidence from multiple, disparate sources to answer difficult questions.