Research & Papers

PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

New 7,635-query benchmark shows top AI image search models fail 9% of the time on hard negatives.

Deep Dive

A research team led by Rohan Mahadev and five other authors has published PinPoint, a major new benchmark for evaluating Composed Image Retrieval (CIR) systems, accepted for CVPR 2026. The benchmark addresses significant limitations in existing evaluations by providing 7,635 real-world queries across 23 categories, each with an average of 9.1 correct answers, explicit hard negatives, six instruction paraphrases for robustness testing, and support for multi-image composition (13.4% of queries). This comprehensive dataset, with 329,000 relevance judgments, is designed to test false positive avoidance, robustness, and multi-image reasoning—capabilities crucial for real-world applications like e-commerce and creative tools.

Analysis of over 20 methods across four major paradigms using PinPoint uncovered three critical shortcomings: even the best models, achieving a mAP@10 of 28.5%, still retrieve irrelevant hard negatives 9% of the time. Performance varies by 25.1% across paraphrased instructions, and multi-image queries see a 40-70% performance drop. To address these issues, the researchers propose a novel, training-free reranking method based on an off-the-shelf Multimodal Large Language Model (MLLM) that can be applied to any existing CIR system to bridge the performance gap. The complete dataset, annotations, and benchmarking code have been publicly released to spur development of more robust and reliable image retrieval AI.

Key Points
  • PinPoint benchmark contains 7,635 queries with 329K judgments, exposing flaws in current AI image search.
  • Top models fail on 9% of hard negatives and show 25.1% performance drop with paraphrased instructions.
  • Proposes a training-free MLLM reranking method to improve any existing system, with full dataset released.

Why It Matters

Sets a new standard for evaluating real-world AI image search, pushing for models that are more robust, fair, and reliable for applications like product discovery.