Flat-Pack Bench reveals AI struggles to assemble furniture step by step
Top vision-language models fail at IKEA-style assembly—new benchmark exposes the gap.
Deep Dive
Researchers introduce Flat-Pack Bench, a fine-grained spatio-temporal benchmark for large vision-language models (LVLMs) using furniture assembly tasks. It tests temporal ordering, part mating, and tracking via multiple-choice questions with visual prompts. Experiments show state-of-the-art LVLMs struggle significantly, revealing poor temporal reasoning and limited understanding of physical interactions compared to existing coarse-grained benchmarks.
Key Points
- Flat-Pack Bench tests fine-grained tasks: temporal ordering, assembly state localization, part mating, and part tracking.
- Top models (GPT-4V, Gemini) perform near chance on temporal ordering and physical contact reasoning.
- Current video benchmarks focus on coarse actions; this benchmark highlights a gap for step-by-step physical reasoning.
Why It Matters
If AI can't follow furniture assembly, it can't handle real-world tasks like cooking or construction—critical for robotics and assistance.