DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation
Zero-shot segmentation gets a 2x accuracy boost with no new parameters...
A new paper from researchers Mohamad Zamini and Diksha Shukla introduces DouC, a training-free framework that significantly improves open-vocabulary semantic segmentation using a dual-branch CLIP architecture. The system addresses two key limitations of existing CLIP-based methods: unreliable local tokens and insufficient spatial coherence. DouC decomposes dense prediction into two complementary branches—OG-CLIP enhances patch-level reliability through lightweight, inference-time token gating, while FADE-CLIP injects external structural priors via proxy attention guided by frozen vision foundation models. The two branches are fused at the logit level, enabling local token reliability and structure-aware patch interactions to jointly influence final predictions. Optional instance-aware correction is applied as post-processing.
DouC introduces no additional learnable parameters, requires no retraining, and preserves CLIP's zero-shot generalization capabilities. Extensive experiments across eight benchmarks and multiple CLIP backbones demonstrate that DouC consistently outperforms prior training-free methods and scales favorably with model capacity. This means developers and researchers can achieve state-of-the-art open-vocabulary segmentation without the computational cost of fine-tuning, making it highly practical for real-world applications like autonomous driving, medical imaging, and robotics where labeling new categories on the fly is critical.
- DouC uses two branches: OG-CLIP (token gating for patch reliability) and FADE-CLIP (structure priors from frozen vision models).
- Zero additional training or parameters required; preserves CLIP's zero-shot generalization.
- Outperforms prior training-free methods across eight benchmarks and scales with model size.
Why It Matters
DouC enables accurate pixel-level labeling without retraining, cutting costs and deployment time for vision AI.