Tuning Qwen2.5-VL to Improve Its Web Interaction Skills
A new training pipeline tackles AI's overconfidence and poor localization on web tasks.
A research team from Aalto University and Silo AI has published a paper detailing how they significantly improved the web interaction capabilities of Qwen2.5-VL-32B, a leading open-source vision-language model (VLM). Their work addresses a critical gap: using VLMs as independent agents that can reason and act on a computer screen using only visual input, a key step toward practical AI automation. The researchers identified three core failures in the base model: inaccurate localization of page elements and the cursor, sensitivity to how instructions are phrased, and an "overoptimistic bias" where the AI assumes its actions succeeded without checking the result.
To solve these issues, the team developed a specialized two-stage fine-tuning pipeline focused on a fundamental task: moving a mouse cursor and clicking a described element. The first stage trains the model to accurately determine if the cursor is already over the target. The second stage trains it to execute just one command—a move or a click—and then critically analyze the resulting screen state before planning the next action. This step-by-step, verification-heavy approach directly counters the model's tendency to hallucinate success. Evaluated on a custom benchmark, their tuned model increased task success rates from 86% to 94% in the most challenging setting. The paper has been accepted to the ACM Web Conference 2026.
- Fine-tuned Qwen2.5-VL-32B model achieves a 94% success rate on single-click web tasks, an 8-point jump from 86%.
- Two-stage training pipeline combats AI's "overoptimistic bias" by forcing verification after each mouse move or click.
- Research tackles core VLM weaknesses for web agency: poor visual localization and instruction sensitivity.
Why It Matters
This work is a concrete step toward reliable AI agents that can automate tedious digital tasks by seeing and acting like a human.