Agent Frameworks

Enhancing CLIP Robustness via Cross-Modality Alignment

A new training-free framework fixes CLIP's vulnerability to image attacks without sacrificing clean accuracy.

Deep Dive

A research team led by Xingyu Zhu has introduced COLA (Cross-modality Alignment), a novel framework designed to address a critical weakness in popular vision-language models (VLMs) like CLIP. While CLIP excels at zero-shot classification by aligning images and text, its encoded features are highly vulnerable to adversarial perturbations—subtle, malicious image modifications that cause severe performance drops. Existing defenses often involve costly adversarial fine-tuning or prompt tweaks, but COLA tackles the root cause: the misalignment between image and text feature spaces that worsens under attack. This NeurIPS 2025 Spotlight paper presents a training-free solution that directly restores this cross-modal alignment.

The COLA framework operates in two key steps. First, it projects adversarial image embeddings onto a subspace defined by class text features, filtering out non-semantic noise. Second, it models images and texts as distributions and refines their alignment using optimal transport (OT), a mathematical method for comparing distributions. This dual approach ensures both global alignment and local structural consistency in the feature space. Tested across 14 benchmarks, COLA boosted CLIP's adversarial robustness by an average of 6.7% on ImageNet and its variants under strong PGD attacks, without compromising accuracy on standard, clean images. Its plug-and-play, training-free nature means it can be seamlessly integrated with existing fine-tuned CLIP models, offering a practical upgrade for real-world applications where reliability against manipulation is crucial.

Key Points
  • COLA is a training-free framework that fixes CLIP's feature misalignment under attack using optimal transport.
  • It improved zero-shot classification accuracy by an average of 6.7% on ImageNet under PGD adversarial attacks.
  • The method is compatible with existing fine-tuned models and maintains high accuracy on clean samples.

Why It Matters

Makes AI vision systems like CLIP significantly more reliable and secure against malicious image manipulations in real-world use.