Research & Papers

OmniGUI: Benchmark Tests AI Agents on Multimodal Smartphone Tasks

First step-level benchmark combining static images, audio, and video for GUI agents

Deep Dive

Current GUI agent benchmarks rely on static screenshots, but real smartphone interactions demand processing transient audio cues and temporal video dynamics. To fill this gap, researchers from multiple institutions created OmniGUI, the first step-level benchmark designed for omni-modal smartphone environments. The dataset provides continuous, interleaved multimodal inputs—static images, synchronous audio, and video clips—at every action step. It contains 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, with systematic annotation of multimodal dependency levels.

Initial baselines using foundational omni-modal models reveal significant degradation in action prediction performance when tasks require synchronous temporal and auditory signals. Ablation studies isolated cross-modal interference as a key bottleneck, particularly when processing task-irrelevant environmental noise. The full dataset, evaluation pipeline, and baseline prompts are open-sourced to accelerate research into multimodal GUI agents.

Key Points
  • 709 expert-demonstrated episodes with 2,579 action steps across 29 smartphone apps
  • First benchmark to include synchronous audio, video clips, and static images at each action step
  • Current models show 30–50% accuracy drop on tasks requiring temporal or auditory signals vs. static tasks

Why It Matters

Sets a new standard for evaluating AI agents that must handle real-world smartphone interactions with multiple sensory inputs