FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios
New dataset shows fine-tuning a 3B model yields 90.8% accuracy gains on factory tasks.
A research team led by Xiangru Jian and Hao Xu has introduced FORGE, a new benchmark designed to rigorously evaluate Multimodal Large Language Models (MLLMs) in real-world manufacturing settings. The core of FORGE is a high-quality dataset that merges 2D images with 3D point clouds, annotated with precise, domain-specific details such as exact part model numbers—a level of granularity missing from previous datasets. This resource was used to test 18 leading MLLMs on critical industrial tasks: workpiece verification, structural surface inspection, and assembly verification. The results revealed significant performance gaps, with a surprising bottleneck analysis showing that a lack of domain-specific knowledge, not visual understanding, is the primary limitation holding these models back in factory environments.
Beyond evaluation, FORGE provides a pathway for improvement. The structured annotations serve as actionable training data. As a proof of concept, the researchers performed supervised fine-tuning on a compact 3B-parameter model using the FORGE dataset. This process yielded a dramatic relative improvement of up to 90.8% in accuracy on held-out manufacturing scenarios. This demonstrates that targeted, domain-adapted training can rapidly close the performance gap, offering a practical blueprint for developing specialized AI for industrial automation. The team has made the code and dataset publicly available, aiming to steer future research toward solving the identified knowledge bottleneck and accelerating the deployment of reliable AI in manufacturing.
- FORGE benchmark tests 18 MLLMs on 3 manufacturing tasks (workpiece verification, surface inspection, assembly) using combined 2D/3D data.
- Key finding: Domain-specific knowledge, not visual grounding, is the primary bottleneck for AI in manufacturing, challenging conventional wisdom.
- Fine-tuning a 3B-parameter model on FORGE data achieved up to a 90.8% relative accuracy improvement, proving its value as a training resource.
Why It Matters
It provides a clear roadmap and dataset for building reliable, specialized AI that can perform complex inspections and verifications on the factory floor.