Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
A new method improves few-shot image classification by 10-15% without adding any inference cost.
A team of researchers has developed a novel method to significantly improve the performance of few-shot learning adapters for vision-language models like CLIP, without incurring any additional computational cost during deployment. The core innovation is an asymmetric 'training-only' framework. Instead of modifying the lightweight adapter (e.g., Tip-Adapter) used for fast inference, the team constructs a powerful, auxiliary 'Heterogeneous Graph Teacher' that operates solely during the training phase. This teacher model integrates multi-scale visual patches and textual class prompts into a unified graph structure and performs deep cross-modal reasoning using a Modality-aware Graph Transformer (MGT).
Crucially, the relational knowledge and high-fidelity class features extracted by this complex teacher model are distilled directly into the simple adapter's key-value cache through a cache-aware dual-objective training strategy. Once training is complete, the bulky graph teacher is entirely discarded. This means that at inference time, the system is identical to the original, efficient Tip-Adapter—it uses the same cached prototypes for fast matching but now with significantly upgraded, relationally-aware features. The method establishes a new state-of-the-art across standard few-shot benchmarks (1 to 16 shots), with ablations confirming that the graph supervision, text-guided reasoning, and node filtering are key to its robust performance gains.
- Introduces a 'training-only' Heterogeneous Graph Teacher that boosts adapter performance by analyzing fine-grained image-patch-text relations.
- Uses a Modality-aware Graph Transformer (MGT) for cross-modal reasoning and applies discriminative node filtering to extract better class features.
- Distills knowledge into the adapter's cache, achieving SOTA results on 1-16 shot benchmarks with zero extra inference latency or memory.
Why It Matters
Enables more accurate and robust AI vision systems that can learn from very few examples, without sacrificing the speed needed for real-world applications.