LocateAnything uses parallel box decoding for extremely fast object detection in photos, app UIs, and text?

LocateAnything uses parallel box decoding for extremely fast object detection in photos, app UIs, and text

Outperforms Qwen3-VL and Rex-Omni on granular tasks like recognizing individual windows in buildings and UI elements?

Outperforms Qwen3-VL and Rex-Omni on granular tasks like recognizing individual windows in buildings and UI elements

Open-source 3B model available on Hugging Face; targeted at robotics and automated PC operation?

Open-source 3B model available on Hugging Face; targeted at robotics and automated PC operation

Viral Wire

NVIDIA's LocateAnything detects objects in photos, UIs, and text at high speed

GIGAZINE / @NVIDIAAI May 29, 2026

⚡New open-source model beats Qwen3-VL and Rex-Omni on UI and fine-grained detection

Deep Dive

NVIDIA has unveiled LocateAnything, a high-speed vision-language grounding model designed for object detection across diverse media — including photos, application screenshots, and text documents. The model uses a novel parallel box decoding mechanism that enables extremely fast bounding box prediction, as demonstrated in a video that shows near-instant detection. LocateAnything is open-source, with a 3B parameter version available on Hugging Face, and it will be presented at CVPR 2026.

In benchmark comparisons, LocateAnything significantly outperforms Qwen3-VL and Rex-Omni on challenging tasks such as recognizing each window in a building individually, identifying separate pieces of wood, and high-accuracy character recognition. The training data includes not only natural images but also UI layouts and document structures, giving it a unique edge in detecting application UI elements like menu items. A live demo on Hugging Face Spaces lets users upload images or screenshots and specify targets (e.g., “video-game” or “File, Edit, View”) to see precise detection in real time. This versatility makes LocateAnything a promising tool for robotics (where fast, fine-grained object detection is critical) and automated PC operation (e.g., interacting with UI elements programmatically).

Key Points

LocateAnything uses parallel box decoding for extremely fast object detection in photos, app UIs, and text
Outperforms Qwen3-VL and Rex-Omni on granular tasks like recognizing individual windows in buildings and UI elements
Open-source 3B model available on Hugging Face; targeted at robotics and automated PC operation

Why It Matters

Fast, open-source object detection that works across images, UIs, and text will accelerate robotics and desktop automation.

Read Original Article

NVIDIA's LocateAnything detects objects in photos, UIs, and text at high speed

Why It Matters

Related Articles

🚀 Stay Ahead in AI