Research & Papers

Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs

New 2D grid-based benchmark shows GPT-4o and Claude 3.5 fail at basic cognitive tasks designed for children.

Deep Dive

A research team led by Hengwei Ye and nine other collaborators has introduced KidGym, a novel benchmark that adapts principles from the Wechsler Intelligence Scales—standardized tests for evaluating children's cognitive development—to assess Multimodal Large Language Models (MLLMs). The benchmark comprises 12 distinct tasks organized within a 2D grid environment, systematically evaluating five core capabilities: Execution, Perception Reasoning, Learning, Memory, and Planning. By using randomly generated object layouts and diverse scenarios, KidGym provides a robust framework for measuring MLLMs' adaptability and developmental potential, mirroring stages of human cognitive growth.

When applied to state-of-the-art models like GPT-4o and Claude 3.5, KidGym revealed surprising deficiencies. These advanced models, which excel at processing text and images, struggled with fundamental reasoning tasks that children typically master. The benchmark's design allows full customization and extensibility, enabling researchers to create new evaluation scenarios and adjust difficulty levels, making it a valuable tool for the rapidly evolving MLLM community. The paper, accepted at ICLR 2026, highlights that while MLLMs aim for general, human-like competence, significant gaps remain in their basic cognitive and planning abilities.

Key Points
  • KidGym evaluates MLLMs across 5 cognitive capabilities using 12 tasks in a 2D grid system
  • Benchmark inspired by children's intelligence tests (Wechsler Scales) reveals models like GPT-4o struggle with basic reasoning
  • Fully customizable framework allows researchers to create new scenarios and adjust difficulty for evolving models

Why It Matters

Reveals fundamental gaps in AI reasoning, pushing development toward more human-like cognitive abilities beyond pattern recognition.