Research & Papers

Flat-Pack Bench reveals AI struggles to assemble furniture step by step

Top vision-language models fail at IKEA-style assembly—new benchmark exposes the gap.

Deep Dive

Researchers introduce Flat-Pack Bench, a fine-grained spatio-temporal benchmark for large vision-language models (LVLMs) using furniture assembly tasks. It tests temporal ordering, part mating, and tracking via multiple-choice questions with visual prompts. Experiments show state-of-the-art LVLMs struggle significantly, revealing poor temporal reasoning and limited understanding of physical interactions compared to existing coarse-grained benchmarks.

Key Points
  • Flat-Pack Bench tests fine-grained tasks: temporal ordering, assembly state localization, part mating, and part tracking.
  • Top models (GPT-4V, Gemini) perform near chance on temporal ordering and physical contact reasoning.
  • Current video benchmarks focus on coarse actions; this benchmark highlights a gap for step-by-step physical reasoning.

Why It Matters

If AI can't follow furniture assembly, it can't handle real-world tasks like cooking or construction—critical for robotics and assistance.