Research & Papers

IN2R: Graph-based intra-modal reasoning fixes noisy cross-modal data

New model beats noisy web data by using intra-modal neighbors instead of discrete labels.

Deep Dive

Large-scale web-harvested datasets are critical for cross-modal retrieval (e.g., image-text matching), but they often contain noisy correspondence—pairs where the image and text don't actually match. Existing methods try to filter out bad pairs or find substitute labels, but they rely on a single “discrete” proxy, which introduces fragility and error.

To overcome this, a team led by Yang Liu proposes IN2R (Intra-modal Neighbor-aware Noise Rectification). Instead of discarding noisy pairs, IN2R leverages the geometric stability within each modality (image/image or text/text) to synthesize a reliable supervision target. It uses a Graph Refiner to reason over neighbors retrieved from a dynamic cross-model memory, producing a continuous soft prototype that encodes the consensus of the local semantic neighborhood. This effectively rectifies inter-modal misalignment without relying on any single discrete label. The framework achieves state-of-the-art performance on Flickr30K, MS-COCO, and CC152K, and its code and models are publicly available.

Key Points
  • IN2R shifts from discrete label selection to synthesizing continuous soft prototypes via graph-based reasoning over intra-modal neighbors.
  • Uses a Graph Refiner and dynamic cross-model memory to exploit geometric stability within each modality (image/image, text/text).
  • Outperforms state-of-the-art noisy correspondence methods on Flickr30K, MS-COCO, and CC152K datasets.

Why It Matters

Cleaner cross-modal retrieval from noisy web data, enabling more robust AI training pipelines.