Research & Papers

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs

A new benchmark uses 3D furniture assembly videos and manuals to stress-test multimodal AI assistants.

Deep Dive

A team of researchers from the University of Milan and other institutions has introduced a novel benchmark called the Manual to Action Dataset (M2AD). This dataset is designed to rigorously evaluate the capabilities of Multimodal Large Language Models (MLMs) like GPT-4V or Claude 3 in performing as real-time assistants for complex, procedural tasks. M2AD aligns 3D furniture assembly videos with corresponding pages from instruction manuals, creating a structured test to see if AI can understand a sequence of steps, track progress, and correctly reference technical documentation.

The study used M2AD to assess three core capabilities: whether MLMs can reduce the need for expensive, detailed manual labeling; if they can accurately track the progression of assembly steps; and if they can refer to the correct manual pages. The findings revealed significant limitations in current openly available models. While some demonstrated an understanding of procedural sequences, their performance was hampered by architectural constraints, particularly in handling multiple images and interleaved text-and-image reasoning required for dynamic, step-by-step tasks.

This work underscores a critical gap between the promise of AI assistants for technical support and their current practical abilities. For MLMs to be truly useful in Augmented Reality (AR) or Virtual Reality (VR) scenarios—where a user needs guidance while looking at a physical object—they must improve at processing a stream of visual context and linking it to textual instructions. The M2AD dataset provides a concrete benchmark to drive this research forward, pushing models beyond static image analysis toward dynamic, sequential visual reasoning.

Key Points
  • The 'Manual to Action Dataset' (M2AD) aligns furniture assembly videos with instruction manual pages for step-by-step evaluation.
  • It tests three MLM capabilities: reducing annotation needs, tracking assembly step progression, and correctly referencing manuals.
  • Results show current models are limited by an inability to perform robust multi-image and interleaved text-image reasoning.

Why It Matters

It exposes a key weakness in today's AI assistants, holding back practical AR/VR guides for real-world tasks like repair and assembly.