Research & Papers

Video-Based Reward Modeling for Computer-Use Agents

New 8B parameter model scores 84.7% accuracy by watching screen recordings, outperforming proprietary giants.

Deep Dive

A research team led by Linxin Song from Stanford University has introduced a novel method for evaluating AI agents that operate computers. Their system, called the Execution Video Reward Model (ExeVRM), judges an agent's performance not by analyzing its internal code or reasoning, but simply by watching a video recording of the screen during task execution. This video-based approach is fundamentally model-agnostic, meaning it can evaluate agents built on any underlying architecture, from open-source models to proprietary systems like GPT-5.2.

To train their 8-billion parameter model, the team created ExeVR-53k, a dataset of 53,000 high-quality examples pairing a user instruction, a screen-recording video, and a human judgment of success. A key innovation is 'adversarial instruction translation,' which synthetically creates tricky negative examples to improve the model's discernment. They also developed 'spatiotemporal token pruning' to efficiently process long, high-resolution videos by focusing computational power on the moments where the user interface actually changes.

The results are striking. ExeVRM 8B achieved 84.7% accuracy and 87.7% recall in assessing whether a computer-using agent successfully completed a task, outperforming much larger proprietary models from OpenAI and Google. It proved effective across four major operating systems: Ubuntu, macOS, Windows, and Android. Crucially, because it analyzes the visual timeline, it can provide 'temporal attribution,' pinpointing exactly which step in a multi-step process caused a failure.

This work addresses a critical bottleneck in AI agent development: scalable, reliable evaluation. As agents become more complex, manually checking if they correctly booked a flight or filled out a spreadsheet is unsustainable. ExeVRM offers an automated, objective, and general-purpose solution, potentially accelerating the development of more trustworthy and capable digital assistants.

Key Points
  • ExeVRM 8B model scores 84.7% accuracy evaluating agents via video, beating GPT-5.2 and Gemini 3 Pro.
  • Trained on ExeVR-53k, a new dataset of 53k video-task-reward examples with synthetic negative samples.
  • Uses 'spatiotemporal token pruning' to efficiently analyze long screen recordings across Ubuntu, macOS, Windows, and Android.

Why It Matters

Provides a scalable, objective way to evaluate AI assistants, crucial for developing reliable agents that can safely automate complex computer tasks.