Vision-Language Models vs Human: Perceptual Image Quality Assessment
A new study benchmarks six vision-language models against human psychophysical data for image quality.
A new research paper titled "Vision-Language Models vs Human: Perceptual Image Quality Assessment" provides a systematic benchmark of how well current AI models can judge image quality like a human. The study, authored by Imran Mehmood, Imad Ali Shah, Ming Ronnier Luo, and Brian Deegan, tested six Vision-Language Models (VLMs)—including four proprietary and two open-weight models—against established human psychophysical data. The models were evaluated on three key perceptual scales: contrast, colorfulness, and overall preference. The goal was to see if these multimodal AI systems, which combine vision and language understanding, could offer a scalable and cost-effective alternative to expensive human studies for tasks like photo editing, content moderation, and graphic design quality control.
The results revealed a nuanced and attribute-dependent performance. VLMs demonstrated remarkably high alignment with human judgments for colorfulness, achieving correlation coefficients (ρ) as high as 0.93. However, their performance on assessing contrast was significantly weaker. An analysis of how models weigh these attributes showed that most VLMs, similar to human data, assign higher importance to colorfulness over contrast when forming an overall preference score. A counterintuitive finding was the trade-off between self-consistency and human alignment: the models that gave the most repeatable answers were not necessarily the ones that agreed most with people, suggesting that human-like judgment involves sensitivity to subtle, scene-dependent cues. Furthermore, agreement between humans and VLMs increased with perceptual separability, meaning the AI is more reliable when the differences in image quality are obvious and clearly expressed.
- VLMs showed near-human accuracy for colorfulness judgment with correlation up to 0.93.
- Performance was attribute-dependent, with models significantly underperforming on contrast assessment compared to humans.
- The most self-consistent models were not the most human-aligned, revealing a key trade-off in AI perception.
Why It Matters
This research guides developers in building better AI for automated photo editing, content quality filters, and graphic design tools.