ReVision cuts visual token usage by 46% for computer-use AI agents
New technique slashes token costs by 46% while boosting agent performance
Computer-use agents (CUAs) that navigate graphical user interfaces rely on processing screenshots, each generating a massive number of visual tokens. As interaction trajectories lengthen, token costs skyrocket, forcing models to drop historical context—leading to performance plateaus. A team led by Amirhossein Abaskohi (with co-authors from multiple institutions) introduces ReVision, a training method that learns to select and remove redundant visual patches between consecutive screenshots while preserving spatial structure. By comparing patch representations across frames, the model drops temporally redundant information without losing critical task cues.
Tested on OSWorld, WebTailBench, and AgentNetBench using Qwen2.5-VL-7B with five history screenshots, ReVision reduces token usage by 46% on average and boosts success rate by 3% over the no-drop baseline. More importantly, it demonstrates for the first time that longer history—up to 20+ screenshots—continues to improve performance when redundancy is removed, overturning the common belief that visual history saturates. This breakthrough means CUAs can now scale to longer, more complex tasks without exploding compute and cost.
- ReVision reduces visual token usage by ~46% on average across three benchmarks with Qwen2.5-VL-7B
- Success rate improves by 3% over no-drop baseline while using far fewer tokens
- Enables agents to benefit from longer history (5+ screenshots) without performance saturation
Why It Matters
Unlocks cost-efficient AI agents that leverage rich visual history, critical for automating complex desktop and web workflows.