ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models
New benchmark shows AI models fail to request user help, a key collaborative skill, even with hints.
A research team from the University of Trento and University of Montpellier has published a new benchmark called ProactiveBench, designed to measure a critical but overlooked skill in multimodal large language models (MLLMs): proactiveness. The core question is whether AI models can recognize when they need more information and ask a user for simple interventions, much like a human collaborator would. The benchmark was constructed from seven existing datasets repurposed to test scenarios like recognizing occluded objects, enhancing poor-quality images, and interpreting ambiguous sketches.
In a comprehensive evaluation of 22 popular MLLMs, the researchers made several key discoveries. First, current models overwhelmingly lack this proactive behavior. Surprisingly, a model's capacity (size and compute) showed no correlation with its proactiveness score. Even more counterintuitive, providing hints or using in-context learning with conversation histories often introduced negative bias and hurt performance, rather than helping.
The study's most promising finding came from exploring a simple reinforcement learning (RL) fine-tuning strategy. The results demonstrated that proactiveness is a learnable skill; models trained with this method not only improved on the benchmark tasks but also showed signs of generalizing to unseen scenarios. The team has publicly released ProactiveBench, framing it as a foundational step toward building AI assistants that can engage in more natural, collaborative, and effective human-AI teamwork by knowing when to seek clarification.
- Benchmarked 22 MLLMs across 7 tasks, finding a universal lack of proactive 'help-seeking' behavior.
- Found no correlation between model size/capacity and proactiveness scores, challenging assumptions about scaling.
- Simple RL fine-tuning proved proactiveness is a learnable skill that can generalize to new situations.
Why It Matters
For practical AI deployment, assistants that know when to ask questions are safer, more reliable, and truly collaborative.