Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs
New method processes only high-resolution image crops needed for queries, slashing compute costs.
A team of researchers from Tel Aviv University, NVIDIA, and other institutions has introduced a novel framework called AwaRes (Awaiting Resolution) that fundamentally changes how Vision-Language Models (VLMs) process images. Traditional VLMs are forced into a difficult trade-off: use high-resolution images for accuracy (like reading tiny text) and incur massive computational costs, or use low-resolution for efficiency and risk missing critical details. AwaRes solves this by operating on a low-resolution "global" view of an image. When a user asks a question, the model can call a tool to request high-resolution crops of only the specific areas it deems necessary to answer, such as a license plate or a product label.
The researchers created training data automatically using a clever two-step process. First, a "judge" model compares answers from low and high-resolution versions of an image to determine if cropping is needed. Then, an "oracle" grounding model pinpoints the exact evidence for the correct answer, which is mapped to a set of discrete crops. This data was used to train AwaRes through supervised fine-tuning followed by a reinforcement learning technique called GRPO (Group Relative Policy Optimization). The model is rewarded for both answer correctness and for minimizing the number and size of costly high-resolution crops it requests.
This on-demand, spatial retrieval approach promises to make high-performance VLMs far more practical. By avoiding the need to process entire gigapixel images at full resolution, the method can lead to order-of-magnitude reductions in inference time and cost, enabling more complex visual reasoning in real-world applications like document analysis, medical imaging, and autonomous systems.
- Introduces AwaRes, a framework that uses tool-calling to retrieve only necessary high-resolution image crops, avoiding full-image processing.
- Trains models using automatically generated data from a 'judge' and 'oracle', optimized via GRPO with a reward for correctness and crop-cost penalties.
- Aims to resolve the core VLM trade-off, enabling efficient, accurate analysis of fine details (e.g., small text) without prohibitive compute costs.
Why It Matters
Dramatically lowers the cost and latency of running powerful VLMs, making detailed visual analysis feasible for widespread commercial and research use.