NAVER's V-SPLADE boosts visual document search by 13.8pp without neural encoding
Inference-free sparse retrieval that beats dense models on 6 benchmarks with 2x recall at scale.
As large-scale visual document corpora (arXiv papers, enterprise PDFs) grow, retrieval systems must scale without costly neural encoding at query time. Existing approaches either use VLM-based dense models that require neural inference per query, or rely on OCR/caption-based BM25 with slow text extraction. NAVER and Seoul National University researchers fill this gap with V-SPLADE, a learned sparse retriever that indexes visual documents lexically and serves queries with zero neural encoding. The key innovation is caption-gated token supervision, a training-only signal that uses VLM-generated captions as lexical cues to activate retrieval-relevant vocabulary dimensions, overcoming the lexical grounding problem common in visual sparse representations.
V-SPLADE achieves strong results: +13.8 percentage points in NDCG@5 over same-scale dense baselines and up to +6.3pp over OCR- or caption-based BM25 across six visual-document retrieval benchmarks. On an 18.7M document corpus, it more than doubles recall at 5 (R@5) compared to dense models, and further improves competing retrievers via score fusion by up to +2.4pp R@5. This makes V-SPLADE the first deployable system for lexically indexing visual documents without neural query encoding, enabling fast, scalable search for applications like academic paper retrieval, enterprise document management, and archival search.
- Inference-free serving: no neural encoding at query time, enabling fast and scalable production deployment
- Caption-gated token supervision uses VLM-generated captions as lexical cues, improving NDCG@5 by +13.8pp over dense baselines
- Doubles recall at 5 (R@5) on 18.7M document corpus compared to same-scale dense models
Why It Matters
Enables production-scale visual document search without costly neural inference, making AI-powered retrieval practical for real-world enterprise and academic use.