Flat-Pack Bench tests fine-grained tasks?

temporal ordering, assembly state localization, part mating, and part tracking.

Top models (GPT-4V, Gemini) perform near chance on temporal ordering and physical contact reasoning?

Top models (GPT-4V, Gemini) perform near chance on temporal ordering and physical contact reasoning.

Current video benchmarks focus on coarse actions; this benchmark highlights a gap for step-by-step physical reasoning?

Current video benchmarks focus on coarse actions; this benchmark highlights a gap for step-by-step physical reasoning.

Research & Papers

Flat-Pack Bench reveals AI struggles to assemble furniture step by step

arXiv cs.CV May 22, 2026

⚡Top vision-language models fail at IKEA-style assembly—new benchmark exposes the gap.

Deep Dive

Researchers introduce Flat-Pack Bench, a fine-grained spatio-temporal benchmark for large vision-language models (LVLMs) using furniture assembly tasks. It tests temporal ordering, part mating, and tracking via multiple-choice questions with visual prompts. Experiments show state-of-the-art LVLMs struggle significantly, revealing poor temporal reasoning and limited understanding of physical interactions compared to existing coarse-grained benchmarks.

Key Points

Flat-Pack Bench tests fine-grained tasks: temporal ordering, assembly state localization, part mating, and part tracking.
Top models (GPT-4V, Gemini) perform near chance on temporal ordering and physical contact reasoning.
Current video benchmarks focus on coarse actions; this benchmark highlights a gap for step-by-step physical reasoning.

Why It Matters

If AI can't follow furniture assembly, it can't handle real-world tasks like cooking or construction—critical for robotics and assistance.

Read Original Article

Flat-Pack Bench reveals AI struggles to assemble furniture step by step

Why It Matters

Related Articles

🚀 Stay Ahead in AI