CLPIPS: A Personalized Metric for AI-Generated Image Similarity
New metric fine-tuned on human rankings achieves stronger correlation than standard LPIPS for text-to-image workflows.
A team of researchers including Khoi Trinh, Jay Rothenberger, and Anindya Maiti has published a paper on arXiv introducing CLPIPS (Customized Learned Perceptual Image Patch Similarity). This new metric addresses a critical pain point in AI image generation: existing objective metrics like LPIPS and CLIP often fail to align with what humans actually perceive as similar, especially in context-specific creative tasks. CLPIPS is a lightweight, fine-tuned extension of the popular LPIPS metric designed to bridge this perceptual gap.
The core innovation is a fine-tuning process that adapts the metric directly to human judgments. The researchers collected a dataset where participants iteratively regenerated target images and ranked outputs by perceived similarity. Using a margin ranking loss on these human-ranked pairs, they fine-tuned only the combination weights of the LPIPS network's layers, keeping the feature extractors frozen. This efficient approach requires minimal human data but yields significant improvements in alignment.
Evaluation showed CLPIPS achieves stronger correlation and agreement with human rankings than the baseline LPIPS, as measured by Spearman rank correlation and Intraclass Correlation Coefficient (ICC). Rather than chasing absolute performance on standard benchmarks, the work focuses on consistency between metric predictions and human ranks within a workflow. This positions CLPIPS as an adaptive component for human-in-the-loop systems, where it can provide more reliable feedback during iterative prompt refinement with models like Stable Diffusion or DALL-E 3.
- CLPIPS fine-tunes LPIPS layer weights using margin ranking loss on human-ranked image pairs.
- The metric showed stronger Spearman correlation and ICC agreement with human judgments than standard LPIPS.
- It's designed as a lightweight, adaptive feedback tool for human-in-the-loop text-to-image prompt refinement workflows.
Why It Matters
Provides more reliable, human-aligned feedback for artists and designers iteratively refining AI-generated images, reducing guesswork.