Perceptual misalignment of texture representations in convolutional neural networks
A surprising study reveals a major gap between AI vision models and human visual processing.
A team of researchers from SISSA, led by Ludovica de Paolis, has published a groundbreaking paper revealing a critical flaw in how convolutional neural networks (CNNs) perceive visual textures. The study, 'Perceptual misalignment of texture representations in convolutional neural networks,' systematically analyzed a diverse pool of CNNs—models like VGG and ResNet that are foundational to modern computer vision. These models are often benchmarked for their alignment with the mammalian visual system using metrics like Brain-Score. The researchers investigated whether the models' internal texture representations, often captured mathematically by Gram matrices, matched human perceptual judgments.
Surprisingly, the results showed zero connection. A CNN's high score on conventional biological vision benchmarks did not predict its ability to represent textures in a human-like way. This indicates that the mechanisms CNNs learn from training on object recognition tasks (like ImageNet) are fundamentally different from those the human brain uses for texture perception. The authors conclude that human texture perception likely depends on the integration of broader contextual information, a process not captured by standard CNN architectures. This finding challenges the assumption that improving a model's performance on object recognition automatically makes it a better model of human vision.
- Study found no correlation between a CNN's Brain-Score (a measure of biological alignment) and its human-like texture perception.
- Tested a diverse pool of CNNs, including architectures like VGG and ResNet, using Gram matrix-based texture representations.
- Reveals a fundamental gap, suggesting human texture processing uses contextual integration mechanisms absent in standard CNNs.
Why It Matters
This forces a rethink of AI vision benchmarks and could lead to new, more human-aligned model architectures for graphics and medical imaging.