Research & Papers

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

New benchmark shows AI models fail to request user help, a key collaborative skill, even with hints.

Deep Dive

A research team from the University of Trento and University of Montpellier has published a new benchmark called ProactiveBench, designed to measure a critical but overlooked skill in multimodal large language models (MLLMs): proactiveness. The core question is whether AI models can recognize when they need more information and ask a user for simple interventions, much like a human collaborator would. The benchmark was constructed from seven existing datasets repurposed to test scenarios like recognizing occluded objects, enhancing poor-quality images, and interpreting ambiguous sketches.

In a comprehensive evaluation of 22 popular MLLMs, the researchers made several key discoveries. First, current models overwhelmingly lack this proactive behavior. Surprisingly, a model's capacity (size and compute) showed no correlation with its proactiveness score. Even more counterintuitive, providing hints or using in-context learning with conversation histories often introduced negative bias and hurt performance, rather than helping.

The study's most promising finding came from exploring a simple reinforcement learning (RL) fine-tuning strategy. The results demonstrated that proactiveness is a learnable skill; models trained with this method not only improved on the benchmark tasks but also showed signs of generalizing to unseen scenarios. The team has publicly released ProactiveBench, framing it as a foundational step toward building AI assistants that can engage in more natural, collaborative, and effective human-AI teamwork by knowing when to seek clarification.

Key Points
  • Benchmarked 22 MLLMs across 7 tasks, finding a universal lack of proactive 'help-seeking' behavior.
  • Found no correlation between model size/capacity and proactiveness scores, challenging assumptions about scaling.
  • Simple RL fine-tuning proved proactiveness is a learnable skill that can generalize to new situations.

Why It Matters

For practical AI deployment, assistants that know when to ask questions are safer, more reliable, and truly collaborative.