LAGO framework boosts zero-shot image recognition with fewer regions
Class-agnostic detection first, then adaptive language refinement avoids prediction loops.
Zero-shot image recognition typically matches whole images to class descriptions, but fine-grained tasks require focusing on local parts. Existing localized methods crop many random regions, wasting computation and introducing noise. Worse, adding semantic guidance too early creates a "prediction loop": inaccurate intermediate results bias subsequent localization, compounding errors.
LAGO solves this with a two-stage approach: first, a class-agnostic object detector stably identifies candidate regions without semantic bias. Then, adaptive language-guided refinement fine-tunes these regions using intermediate confidence scores to control how much semantic influence to apply. Finally, object-level, contextual, and full-image evidence are fused via an object-context dual-channel aggregation. Experimental results show LAGO achieves SOTA on benchmarks like CUB, SUN, and AWA2, as well as under distribution shifts, while requiring substantially fewer candidate regions at inference.
- LAGO uses class-agnostic object detection first to avoid the 'prediction loop' failure mode.
- Adaptive confidence-based semantic guidance reduces reliance on many random crops.
- Achieves state-of-the-art on multiple fine-grained zero-shot benchmarks with fewer candidate regions.
Why It Matters
Efficient, robust zero-shot recognition enables real-time fine-grained classification without retraining, crucial for evolving visual AI applications.