Research & Papers

[R] Evaluating MLLMs with Child-Inspired Cognitive Tasks

New benchmark reveals AI models struggle with abstract reasoning and multi-rule coordination despite strong single-task performance.

Deep Dive

A research team has introduced KidGym, a novel benchmark for evaluating Multimodal Large Language Models (MLLMs) that moves beyond static assessments to continuous, interactive testing. Inspired by the Wechsler Intelligence Scale for Children (WISC), KidGym organizes evaluation into five cognitive dimensions: Execution, Memory, Learning, Planning, and Perception Reasoning. The benchmark features 12 task categories across three difficulty levels, with randomized layouts and diverse scenarios designed to test both single abilities and compositional reasoning while minimizing data leakage. Built as an interactive 2D grid world with a gym-style API, it includes LLM-friendly features like a backpack system, hint panel, and high-level actions for easier community adoption and extension.

The benchmark's findings reveal significant gaps in current MLLM capabilities. While strong models perform well on isolated, single-ability tasks, their performance drops noticeably when faced with challenges requiring abstract or non-semantic visual reasoning, numerical sensitivity and counting, and multi-rule coordination across different cognitive abilities. These weaknesses highlight how current benchmarks may overestimate model capabilities by testing skills in isolation rather than in the complex, compositional ways humans use them. KidGym's five-dimensional capability radar charts provide more interpretable, fine-grained evaluation of where models succeed and fail in interactive settings.

Accepted for presentation at ICLR 2026, KidGym represents a shift toward more faithful evaluation of AI systems in continuous interaction scenarios. The benchmark's design emphasizes generalization beyond memorization through randomized elements, and its open-source nature encourages the research community to build upon it. This work suggests that achieving true compositional reasoning in AI will require moving beyond current benchmark paradigms to test how models integrate multiple cognitive skills in dynamic environments.

Key Points
  • KidGym tests MLLMs across 5 cognitive dimensions (Execution, Memory, Learning, Planning, Perception Reasoning) with 12 task categories and 3 difficulty levels
  • Benchmark reveals models struggle with abstract visual reasoning, numerical counting, and multi-rule coordination despite strong single-task performance
  • Features gym-style API, randomized layouts, and LLM-friendly interaction design for community customization and extension

Why It Matters

Provides more realistic evaluation of AI reasoning capabilities, revealing critical weaknesses in compositional thinking that current benchmarks miss.