Research & Papers

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

New benchmark shows even GPT-5 scores only 7.39% on proactive tasks, highlighting a major AI frontier.

Deep Dive

A research team led by Dezhi Kong and 14 other authors has published ProactiveMobile, a groundbreaking benchmark designed to measure and advance 'proactive intelligence' in mobile AI agents. Unlike current multimodal large language models (MLLMs) that operate reactively—merely executing explicit commands—proactive agents autonomously anticipate user needs and initiate actions. The benchmark formalizes this challenge by requiring agents to infer latent user intent from on-device contextual signals and generate executable function sequences from a pool of 63 real-world APIs. This addresses a critical bottleneck in AI development, as the lack of rigorous, executable evaluation has stalled progress toward truly autonomous digital assistants.

The ProactiveMobile benchmark features over 3,660 meticulously curated instances across 14 complex scenarios, with multi-answer annotations and a final audit by 30 experts to ensure quality. In extensive experiments, a fine-tuned version of Alibaba's Qwen2.5-VL-7B-Instruct model achieved a 19.15% success rate, significantly outperforming OpenAI's advanced models, o1 (15.71%) and GPT-5 (7.39%). This stark performance gap reveals that proactivity is a distinct, learnable competency that is widely lacking in today's state-of-the-art models. The benchmark's release provides the essential toolset needed to systematically train and evaluate the next generation of AI agents that can move beyond simple command execution to genuine anticipation and autonomous action on mobile devices.

Key Points
  • Benchmark includes 3,660 test instances across 14 real-world scenarios and a pool of 63 executable device APIs.
  • A fine-tuned Qwen2.5-VL-7B model scored 19.15%, beating OpenAI's o1 (15.71%) and GPT-5 (7.39%) by a wide margin.
  • Results show 'proactive intelligence' is a critical, learnable skill that current top-tier MLLMs fundamentally lack.

Why It Matters

This benchmark defines the next frontier for AI assistants: moving from reactive tools to agents that truly anticipate our needs.