Research & Papers

LAGO framework boosts zero-shot image recognition with fewer regions

Class-agnostic detection first, then adaptive language refinement avoids prediction loops.

Deep Dive

Zero-shot image recognition typically matches whole images to class descriptions, but fine-grained tasks require focusing on local parts. Existing localized methods crop many random regions, wasting computation and introducing noise. Worse, adding semantic guidance too early creates a "prediction loop": inaccurate intermediate results bias subsequent localization, compounding errors.

LAGO solves this with a two-stage approach: first, a class-agnostic object detector stably identifies candidate regions without semantic bias. Then, adaptive language-guided refinement fine-tunes these regions using intermediate confidence scores to control how much semantic influence to apply. Finally, object-level, contextual, and full-image evidence are fused via an object-context dual-channel aggregation. Experimental results show LAGO achieves SOTA on benchmarks like CUB, SUN, and AWA2, as well as under distribution shifts, while requiring substantially fewer candidate regions at inference.

Key Points
  • LAGO uses class-agnostic object detection first to avoid the 'prediction loop' failure mode.
  • Adaptive confidence-based semantic guidance reduces reliance on many random crops.
  • Achieves state-of-the-art on multiple fine-grained zero-shot benchmarks with fewer candidate regions.

Why It Matters

Efficient, robust zero-shot recognition enables real-time fine-grained classification without retraining, crucial for evolving visual AI applications.