Research & Papers

AMES: Approximate Multi-modal Enterprise Search via Late Interaction Retrieval

New architecture allows enterprise search engines to find relevant images and video using just text queries.

Deep Dive

A team of researchers including Tony Joseph and Carlos Pareja has introduced AMES (Approximate Multimodal Enterprise Search), a novel architecture that allows existing enterprise search engines to perform unified searches across text, images, and video. The core innovation is its use of "late interaction retrieval," where different types of data—text tokens, image patches, and video frames—are all embedded into a single, shared representation space using multi-vector encoders. This eliminates the need for separate, modality-specific search logic, letting users find a relevant image or video clip by simply typing a text query.

AMES employs a practical two-stage pipeline for scalability within production environments. First, it performs a fast, parallel token-level Approximate Nearest Neighbor (ANN) search with a per-document Top-M MaxSim approximation. This quickly narrows down candidates. Second, it uses an accelerator-optimized Exact MaxSim re-ranking stage to finalize the most relevant results. Crucially, the researchers designed AMES to be backend-agnostic, demonstrating it can be deployed within a production-grade system like Apache Solr "without architectural redesign." The paper reports that AMES achieves competitive ranking performance on the ViDoRe V3 benchmark, proving the viability of fine-grained multimodal search in real-world enterprise settings.

Key Points
  • Unifies search across text, image, and video modalities by embedding them into a shared representation space.
  • Uses a two-stage pipeline: fast approximate search (ANN) followed by exact re-ranking (MaxSim) for scalable performance.
  • Designed as backend-agnostic, enabling integration into existing systems like Solr without a full architectural overhaul.

Why It Matters

Enables enterprises to finally search their vast troves of images and video as easily as they search documents, unlocking new value from unstructured data.