Inference-free serving?

no neural encoding at query time, enabling fast and scalable production deployment

Caption-gated token supervision uses VLM-generated captions as lexical cues, improving NDCG@5 by +13.8pp over dense baselines?

Caption-gated token supervision uses VLM-generated captions as lexical cues, improving NDCG@5 by +13.8pp over dense baselines

Doubles recall at 5 (R@5) on 18.7M document corpus compared to same-scale dense models?

Doubles recall at 5 (R@5) on 18.7M document corpus compared to same-scale dense models

Research & Papers

NAVER's V-SPLADE boosts visual document search by 13.8pp without neural encoding

arXiv cs.IR June 01, 2026

⚡Inference-free sparse retrieval that beats dense models on 6 benchmarks with 2x recall at scale.

Deep Dive

As large-scale visual document corpora (arXiv papers, enterprise PDFs) grow, retrieval systems must scale without costly neural encoding at query time. Existing approaches either use VLM-based dense models that require neural inference per query, or rely on OCR/caption-based BM25 with slow text extraction. NAVER and Seoul National University researchers fill this gap with V-SPLADE, a learned sparse retriever that indexes visual documents lexically and serves queries with zero neural encoding. The key innovation is caption-gated token supervision, a training-only signal that uses VLM-generated captions as lexical cues to activate retrieval-relevant vocabulary dimensions, overcoming the lexical grounding problem common in visual sparse representations.

V-SPLADE achieves strong results: +13.8 percentage points in NDCG@5 over same-scale dense baselines and up to +6.3pp over OCR- or caption-based BM25 across six visual-document retrieval benchmarks. On an 18.7M document corpus, it more than doubles recall at 5 (R@5) compared to dense models, and further improves competing retrievers via score fusion by up to +2.4pp R@5. This makes V-SPLADE the first deployable system for lexically indexing visual documents without neural query encoding, enabling fast, scalable search for applications like academic paper retrieval, enterprise document management, and archival search.

Key Points

Inference-free serving: no neural encoding at query time, enabling fast and scalable production deployment
Caption-gated token supervision uses VLM-generated captions as lexical cues, improving NDCG@5 by +13.8pp over dense baselines
Doubles recall at 5 (R@5) on 18.7M document corpus compared to same-scale dense models

Why It Matters

Enables production-scale visual document search without costly neural inference, making AI-powered retrieval practical for real-world enterprise and academic use.

Read Original Article

NAVER's V-SPLADE boosts visual document search by 13.8pp without neural encoding

Why It Matters

Related Articles

🚀 Stay Ahead in AI