Research & Papers

VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics

A new benchmark reveals top mobile AI agents fail 100% of tasks when faced with real-world app variations.

Deep Dive

A research team led by Yichen Gong has released VenusBench-Mobile, a new online benchmark designed to rigorously test mobile GUI agents—AIs that navigate and interact with smartphone screens. Unlike previous app-centric benchmarks, VenusBench-Mobile focuses on user-centric tasks that reflect the diversity and instability of real mobile usage, such as handling dynamic app interfaces and completing multi-step workflows. The benchmark introduces two core pillars: defining what to evaluate through realistic user-intent-driven tasks, and how to evaluate via a fine-grained, capability-oriented annotation scheme that diagnoses specific agent failures.

Extensive evaluation of leading mobile GUI agents using this benchmark revealed stark results. The performance gaps were significantly larger than those shown on prior benchmarks, indicating VenusBench-Mobile poses substantially more challenging and realistic tasks. Diagnostic analysis pinpointed that agent failures are dominated by deficiencies in core capabilities like perception (accurately 'seeing' the screen) and memory (recalling previous steps), problems often obscured by coarser evaluations. Most critically, even the strongest agents exhibited near-zero success rates when faced with realistic environment variations, such as different app versions or unexpected pop-ups, exposing their brittleness.

The researchers conclude that current mobile GUI agents remain far from reliable for real-world deployment. VenusBench-Mobile's publicly available code and dataset provide a crucial tool for developers to diagnose weaknesses and build more robust agents capable of handling the messy reality of everyday smartphone use, moving beyond controlled lab environments.

Key Points
  • Benchmark reveals near-zero success rates for top agents under real-world app variations, exposing critical brittleness.
  • Fine-grained diagnostics show failures are dominated by perception and memory issues, not just task logic.
  • Shifts evaluation from app-centric tasks to user-intent-driven workflows, reflecting actual mobile usage diversity.

Why It Matters

This benchmark sets a new, realistic standard for developing AI that can reliably automate tasks on your phone, moving beyond lab demos.