Research & Papers

NAVER's V-SPLADE boosts visual document search by 13.8pp without neural encoding

Inference-free sparse retrieval that beats dense models on 6 benchmarks with 2x recall at scale.

Deep Dive

As large-scale visual document corpora (arXiv papers, enterprise PDFs) grow, retrieval systems must scale without costly neural encoding at query time. Existing approaches either use VLM-based dense models that require neural inference per query, or rely on OCR/caption-based BM25 with slow text extraction. NAVER and Seoul National University researchers fill this gap with V-SPLADE, a learned sparse retriever that indexes visual documents lexically and serves queries with zero neural encoding. The key innovation is caption-gated token supervision, a training-only signal that uses VLM-generated captions as lexical cues to activate retrieval-relevant vocabulary dimensions, overcoming the lexical grounding problem common in visual sparse representations.

V-SPLADE achieves strong results: +13.8 percentage points in NDCG@5 over same-scale dense baselines and up to +6.3pp over OCR- or caption-based BM25 across six visual-document retrieval benchmarks. On an 18.7M document corpus, it more than doubles recall at 5 (R@5) compared to dense models, and further improves competing retrievers via score fusion by up to +2.4pp R@5. This makes V-SPLADE the first deployable system for lexically indexing visual documents without neural query encoding, enabling fast, scalable search for applications like academic paper retrieval, enterprise document management, and archival search.

Key Points
  • Inference-free serving: no neural encoding at query time, enabling fast and scalable production deployment
  • Caption-gated token supervision uses VLM-generated captions as lexical cues, improving NDCG@5 by +13.8pp over dense baselines
  • Doubles recall at 5 (R@5) on 18.7M document corpus compared to same-scale dense models

Why It Matters

Enables production-scale visual document search without costly neural inference, making AI-powered retrieval practical for real-world enterprise and academic use.