Concepts Learned Visually by Infants Can Contribute to Visual Learning and Understanding in AI Models
New research finds AI models learn faster and generalize better when trained with concepts infants use, like animacy and goal attribution.
A new research paper from Shify Treger and Shimon Ullman proposes a novel approach to training AI vision models by mimicking how infants learn. The study focuses on incorporating foundational visual concepts that infants acquire early, such as animacy (the ability to distinguish living from non-living entities) and goal attribution (understanding that agents act with purpose). The researchers modeled how these "early-acquired concepts" can be used as a scaffold for learning more complex visual tasks, like predicting future events in dynamic scenes involving human-object interactions.
Their model was compared against standard deep network approaches. The results were significant: the infant-inspired model achieved higher accuracy in prediction tasks and demonstrated more efficient learning, requiring substantially less training data to reach competence. Furthermore, the combination of early and new concepts shaped the model's internal representations, leading to improved generalization on unseen data. The team also conducted a human study and compared advanced vision-language models (VLMs) on a task requiring understanding of animate vs. inanimate agent behavior, with results supporting the core hypothesis.
This work, detailed in the arXiv preprint 2503.03361, represents a significant extension of earlier research with new experiments and broader evaluations. It challenges the prevailing paradigm of training AI purely on massive, unstructured datasets. Instead, it argues for building in developmental priors—core conceptual building blocks—that can guide and accelerate learning, much as they do in human cognition. The findings open a promising research direction for creating more robust, data-efficient, and interpretable computer vision systems.
- AI models using infant-like concepts (animacy, goal attribution) learned with higher accuracy and required less data than standard deep networks.
- The approach improved the model's ability to predict future events in dynamic visual scenes and generalize to new situations.
- Comparative tests with advanced vision-language models and a human study validated the contribution of these early concepts to visual understanding.
Why It Matters
This research could lead to AI that learns vision more like humans—faster, with less data, and with deeper, more causal understanding of the world.