Research & Papers

ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

New method uses VLMs to generate query-aware descriptions, improving retrieval accuracy across diverse document formats.

Deep Dive

A research team led by Hao Yang has introduced ReAlign, a novel method for improving visual document retrieval systems. Traditional approaches use Vision-Language Models (VLMs) to encode queries and document pages into a shared embedding space through contrastive training. However, these methods struggle when crucial evidence is scattered across complex document layouts, making it difficult to capture the right cues for effective retrieval.

ReAlign addresses this by leveraging the reasoning capabilities of advanced VLMs to provide fine-grained supervision. The system first uses a superior VLM to identify query-related regions on a page, then generates query-aware descriptions grounded in those specific visual areas. The retriever is trained to align semantics between queries and documents by matching ranking distributions from these region-focused descriptions with those from original queries.

Experiments across diverse visually rich document benchmarks show ReAlign consistently improves retrieval performance on both in-domain and out-of-domain datasets, achieving up to 2% relative improvements. The method's advantages generalize across different VLM backbones by guiding models to better focus attention on critical visual cues. All code and datasets are publicly available, making this approach accessible for practical implementation in document search systems.

Key Points
  • Uses VLMs to identify query-related regions and generate fine-grained descriptions as training supervision
  • Achieves up to 2% relative improvement on visual document retrieval benchmarks
  • Generalizes across different VLM architectures by focusing attention on critical visual cues

Why It Matters

Enables more accurate search through complex documents like PDFs and scanned materials where information is visually scattered.