Research & Papers

Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

New visual prompting technique overlays relationship graphs on images, helping models understand 'left of' and 'behind'.

Deep Dive

A research team from the University of Bologna, including Giacomo Frisoni, has introduced Graph-of-Mark (GoM), a novel visual prompting technique designed to enhance the spatial reasoning of multimodal language models (MLMs). Unlike previous methods like Set-of-Mark that simply mark objects with numbered boxes, GoM overlays a complete scene graph onto the input image. This graph visually encodes relationships (e.g., connections, hierarchies) between objects, providing the AI with a structured map of the scene's spatial layout at the pixel level. The technique is training-free, meaning it works with existing, off-the-shelf models without requiring any fine-tuning.

The researchers rigorously evaluated GoM across three open-source MLMs and four benchmark datasets focused on visual question answering and object localization. Their extensive ablation studies tested different drawn graph components and the inclusion of auxiliary textual descriptions. The results were significant: GoM consistently boosted the models' zero-shot capability to interpret object positions and relative directions (like "left of" or "behind"), improving base accuracy by up to 11 percentage points. This advancement directly addresses a key weakness in current vision-language models, which often struggle to understand the relational geometry between objects in a complex scene.

Key Points
  • GoM overlays a scene graph on images, showing object relationships, unlike simple box annotations.
  • Tested on 3 MLMs and 4 datasets, it improved zero-shot spatial reasoning accuracy by up to 11%.
  • The technique is training-free, providing an immediate upgrade to existing multimodal models' capabilities.

Why It Matters

Enables AI to better understand real-world scenes for robotics, autonomous systems, and advanced image analysis without costly retraining.