InterPartAbility: Text-Guided Part Matching for Interpretable Person Re-Identification
Open-vocabulary phrase-region grounding that actually explains why a person was matched...
Person re-identification (ReID) from natural language descriptions is a critical task for surveillance and security, but most deep learning models remain black boxes. While large vision-language models (VLMs) like CLIP have dramatically improved retrieval accuracy, their decisions are often uninterpretable — you can't tell which part of a description matched which part of an image. Existing interpretability methods rely on slot-attention to highlight attended regions, but fail to reliably bind visual regions to semantically meaningful concepts, limiting explanations to qualitative visualizations over a restricted vocabulary.
Enter InterPartAbility, a new approach from researchers at ÉTS Montreal that performs explicit part-wise matching between text phrases and image regions. The key innovation is a Patch-Phrase Interaction Module (PPIM) — a lightweight, open-vocabulary module that provides concept-level supervision to a standard TI-ReID model. PPIM constrains the CLIP ViT self-attention to produce spatially concentrated patch activations aligned with each part-level phrase (e.g., 'blue jacket', 'red shoes'), yielding grounded explanation maps that show exactly which part of the text matched which part of the person in the image. The method also introduces a quantitative interpretability protocol for TI-ReID, adapting perturbation-based metrics like counterfactual region masking — measuring retrieval degradation when the top-ranked explanatory regions are removed.
On challenging benchmarks CUHK-PEDES and ICFG-PEDES, InterPartAbility achieves state-of-the-art interpretability performance under these metrics while sustaining competitive retrieval accuracy. The code is included in the supplementary materials and will be released publicly. This work moves person ReID beyond simple accuracy benchmarks toward models that can explain their reasoning — a crucial step for deployment in sensitive applications like law enforcement and border security where accountability is paramount.
- New Patch-Phrase Interaction Module (PPIM) enables open-vocabulary, part-level grounding between text descriptions and image regions.
- Introduces a quantitative interpretability protocol using counterfactual region masking to measure retrieval degradation when key regions are removed.
- Achieves state-of-the-art interpretability on CUHK-PEDES and ICFG-PEDES benchmarks while preserving competitive retrieval accuracy.
Why It Matters
Makes AI-powered person search explainable — critical for accountability in surveillance, security, and forensic applications.