Robotics

AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

arXiv cs.RO April 13, 2026

⚡Researchers' new multimodal model processes 3D point clouds to predict precise assembly poses for robotics.

Deep Dive

A research team led by Zhi Jing has introduced AssemLM, a spatial reasoning multimodal large language model designed specifically for robotic assembly tasks. Unlike standard vision-language models that rely on coarse 2D perception, AssemLM integrates textual instructions, assembly manuals, and 3D point cloud data to perform accurate reasoning over 3D geometry. This allows it to predict the precise 6D poses (position and orientation) required for assembling components, a critical capability for fine-grained manipulation. The model uses a specialized point cloud encoder to capture fine geometric and rotational features, which are then fed into a multimodal language model to support explicit 3D spatial reasoning.

To train and evaluate AssemLM, the team constructed AssemBench, a large-scale dataset and benchmark containing over 900,000 multimodal samples with precise 6D pose annotations. This benchmark fills a significant gap in embodied AI by extending evaluation beyond 2D tasks into full 3D geometric inference. Extensive experiments demonstrate that AssemLM achieves state-of-the-art performance in 6D pose reasoning across diverse assembly scenarios. Furthermore, real-robot evaluations confirm the model's practical utility, showing it can support the execution of complex, multi-step assembly operations in real-world settings, marking a substantial step toward more capable and autonomous robotic systems.

Key Points

AssemLM is a multimodal LLM that processes 3D point clouds, text, and manuals to predict 6D assembly poses.
The model is trained and evaluated on AssemBench, a new dataset with over 900,000 annotated multimodal samples.
Real-robot tests show the model enables fine-grained, multi-step assembly execution, advancing beyond 2D-aware models.

Why It Matters

This bridges the gap between AI language reasoning and precise physical manipulation, enabling more autonomous and capable industrial and service robots.

Read Original Article

AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

Why It Matters

Stay Ahead in AI