New 'Re-Prefill' method boosts GUI grounding in VLMs by 4.3%
Prefill stage, not decoding, determines GUI grounding accuracy in VLMs.
A new study from researchers reveals a critical bottleneck in how Vision-Language Models (VLMs) perform GUI grounding—the ability to identify UI elements from natural language instructions. The team found that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while decoding refines final coordinates. This asymmetry makes prefill the decisive step, as errors in candidate selection during prefill cannot be corrected later. The finding challenges existing training-free approaches that rely on multiple inference runs like iterative cropping or candidate aggregation, which overlook the foundational role of prefill.
To address this, the authors propose Re-Prefill, a training-free method that introduces an attention-guided second prefill stage. During inference, visual tokens receiving consistently high attention from the query's final token across layers are extracted as a preliminary target hypothesis. These tokens are appended to the input along with instruction hidden states, enabling the model to re-think its decision before generating coordinates. Experiments across four VLMs and five benchmarks—ScreenSpot-Pro, ScreenSpot-V2, OSWorld-G, UI-Vision, and MMBench-GUI—show consistent improvements without any additional training, with gains of up to 4.3% on ScreenSpot-Pro.
- GUI grounding in VLMs has a previously overlooked asymmetry: prefill selects candidates while decoding only refines coordinates, making prefill errors uncorrectable.
- Re-Prefill adds an attention-guided second prefill stage without retraining, using high-attention visual tokens to refine target hypotheses before coordinate generation.
- Achieved up to 4.3% improvement on ScreenSpot-Pro and consistent gains across four VLMs and five benchmarks.
Why It Matters
Boosts GUI agent accuracy without retraining, making VLMs more reliable for automating interface interactions.