Research & Papers

VLATIM benchmark: VLMs plan like humans but fail at precise clicking in puzzle games

New benchmark tests VLMs on The Incredible Machine 2—results show a gap between reasoning and execution.

Deep Dive

A new research paper from Dominik Helfenstein, Marco Menner, and Maximilian Triebel introduces VLATIM (Vision-Language Against The Incredible Machine), a benchmark designed to evaluate how well vision-language models (VLMs) perform human-like logical problem-solving in point-and-click puzzle games. The benchmark uses The Incredible Machine 2, a classic physics puzzle game that requires players to chain together objects like levers, ropes, and balls to achieve a goal. Unlike existing benchmarks that focus on discrete tasks, VLATIM targets the gap between high-level reasoning and continuous action spaces that require precise mouse interactions.

The benchmark is structured into five progressive parts, ranging from basic visual grounding (identifying objects and their properties) to full puzzle solving with multi-step manipulation. The results reveal a significant disparity: large proprietary models (like those from OpenAI and Google) demonstrated strong planning abilities—they could propose logical sequences of actions. However, they consistently failed at precise visual grounding, such as clicking on the exact pixel location of a small lever or correctly positioning an object. None of the tested models achieved human-level problem-solving, underlining that while VLMs can reason about physics, they cannot yet execute that reasoning in a fine-grained interactive environment. This work highlights a critical bottleneck for deploying VLMs in robotics, gaming, and any domain requiring precise physical interaction.

Key Points
  • VLATIM benchmark has five progressive parts, testing from basic visual grounding to full puzzle solving in The Incredible Machine 2.
  • Large proprietary VLMs show strong high-level planning but struggle with precise visual grounding and mouse interactions.
  • No current VLM achieves human-like problem-solving in this physics puzzle environment.

Why It Matters

Highlights a critical gap between AI reasoning and physical interaction, crucial for robotics and gaming.