Research & Papers

vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents

A new local-first system combines vector and keyword search, improving document retrieval for AI agents by up to 21.4%.

Deep Dive

Researcher Jayson Steffens has introduced vstash, a novel local-first document memory system designed to power LLM agents with more accurate and efficient retrieval. The system's core innovation is its hybrid approach, which intelligently fuses vector similarity search (using sqlite-vec) with full-text keyword matching (using SQLite's FTS5) via an adaptive Reciprocal Rank Fusion (RRF) mechanism. All data resides in a single, portable SQLite file, making it ideal for local, privacy-focused applications. The system demonstrated a median search latency of just 20.9 ms across 50,000 document chunks, proving its efficiency for real-time agent use.

A key breakthrough is the system's ability to self-improve. By analyzing disagreements between its vector-heavy and keyword-heavy search modes across 753 queries, vstash generated 76K training examples without human labels. Fine-tuning a small 33M-parameter BGE model on this data boosted retrieval accuracy (NDCG@10) by up to 19.5% on the NFCorpus benchmark. Furthermore, the adaptive RRF with per-query weighting improved results by up to 21.4% on the ArguAna dataset, allowing this compact model to match or exceed the performance of much larger 110M-parameter systems like ColBERTv2 on several benchmarks.

The paper also details a production-ready implementation with integrity checks, schema versioning, and ranking diagnostics, validated on over 50,000 judged queries. The fine-tuned embedding model and all code are open-sourced, providing developers with a robust, high-performance substrate for building LLM agents that need reliable, fast access to personal or proprietary knowledge bases without relying on cloud services.

Key Points
  • Hybrid local retrieval combines vector and keyword search in one SQLite file, achieving 20.9 ms median latency for 50K chunks.
  • Self-supervised tuning on 76K disagreement triples improved a 33M-parameter model's accuracy by up to 19.5%, rivaling larger 110M-param models.
  • Adaptive fusion with per-query weighting boosted search accuracy (NDCG@10) by up to 21.4% on benchmark datasets versus fixed weights.

Why It Matters

Enables developers to build faster, more accurate, and entirely local LLM agents that can reliably search personal documents and data.