Viral Wire

NVIDIA's LocateAnything detects objects in photos, UIs, and text at high speed

New open-source model beats Qwen3-VL and Rex-Omni on UI and fine-grained detection

Deep Dive

NVIDIA has unveiled LocateAnything, a high-speed vision-language grounding model designed for object detection across diverse media — including photos, application screenshots, and text documents. The model uses a novel parallel box decoding mechanism that enables extremely fast bounding box prediction, as demonstrated in a video that shows near-instant detection. LocateAnything is open-source, with a 3B parameter version available on Hugging Face, and it will be presented at CVPR 2026.

In benchmark comparisons, LocateAnything significantly outperforms Qwen3-VL and Rex-Omni on challenging tasks such as recognizing each window in a building individually, identifying separate pieces of wood, and high-accuracy character recognition. The training data includes not only natural images but also UI layouts and document structures, giving it a unique edge in detecting application UI elements like menu items. A live demo on Hugging Face Spaces lets users upload images or screenshots and specify targets (e.g., “video-game” or “File, Edit, View”) to see precise detection in real time. This versatility makes LocateAnything a promising tool for robotics (where fast, fine-grained object detection is critical) and automated PC operation (e.g., interacting with UI elements programmatically).

Key Points
  • LocateAnything uses parallel box decoding for extremely fast object detection in photos, app UIs, and text
  • Outperforms Qwen3-VL and Rex-Omni on granular tasks like recognizing individual windows in buildings and UI elements
  • Open-source 3B model available on Hugging Face; targeted at robotics and automated PC operation

Why It Matters

Fast, open-source object detection that works across images, UIs, and text will accelerate robotics and desktop automation.