Human-like Object Grouping in Self-supervised Vision Transformers
Vision transformers trained with self-supervised DINO objective match human object grouping with 1000+ trial benchmark.
A research team from Columbia University and MIT has published a groundbreaking study demonstrating that self-supervised vision transformers, particularly those trained with Meta's DINO (DIstillation with NO labels) objective, exhibit object perception remarkably similar to humans. The researchers created a novel behavioral benchmark scaling up classical psychophysics to over 1000 trials, where participants made same/different object judgments for dot pairs on naturalistic scenes. They tested diverse vision models using simple readouts from their representations to predict human reaction times, finding steady improvement across model generations with both architecture and training objectives contributing to alignment.
Transformer-based models trained with the DINO self-supervised objective showed the strongest performance in matching human segmentation behavior. The researchers proposed a novel metric to quantify the object-centric component of representations by measuring patch similarity within and between objects, finding that stronger object-centric structure predicts human behavior more accurately across all tested models. Crucially, they demonstrated that matching the Gram matrix (capturing similarity structure across image patches) of supervised transformer models with that of self-supervised models through distillation improves their alignment with human perception, converging with prior findings that Gram anchoring improves DINOv3's feature quality.
These results provide the first comprehensive evidence that modern self-supervised vision models capture object structure in a behaviorally human-like manner, with Gram matrix structure playing a key role in driving this perceptual alignment. The study bridges computer vision and cognitive science, offering new methods for evaluating AI vision systems against human perceptual benchmarks and insights for developing more human-aligned computer vision models.
- DINOv3 models trained with self-supervised objectives showed strongest alignment with human object perception in 1000+ trial benchmark
- Researchers developed novel metric showing stronger object-centric structure in models predicts human segmentation behavior more accurately
- Gram matrix matching through distillation improves model alignment with human perception, revealing key mechanism for human-like vision
Why It Matters
This research provides crucial insights for developing AI vision systems that perceive the world more like humans do, with applications in robotics, autonomous vehicles, and assistive technologies.