Research & Papers

MEDiC: Multi-objective Exploration of Distillation from CLIP

New vision training method achieves 73.9% kNN accuracy on ImageNet by combining three complementary objectives.

Deep Dive

Researchers from the University of Tennessee have introduced MEDiC (Multi-objective Exploration of Distillation from CLIP), a novel framework that addresses limitations in current masked image modeling (MIM) approaches. Traditional MIM methods typically operate in either raw pixel space (reconstructing masked patches) or latent feature space (aligning with a pre-trained teacher like CLIP). MEDiC innovatively combines both spaces through three complementary objectives: patch-level token distillation from a frozen CLIP encoder, global CLS alignment, and pixel reconstruction via a lightweight decoder. This multi-objective approach provides more comprehensive training signals than single-objective methods.

The researchers conducted a systematic investigation of the design space, revealing several key insights. First, all three objectives provide complementary information, with the full combination achieving 73.9% kNN accuracy on ImageNet-1K. Second, they introduced hierarchical clustering with relative position bias for evolved masking but found that despite producing more semantically coherent masks, evolved masking didn't outperform simple block masking in teacher-guided distillation settings. Third, and perhaps most surprisingly, they discovered that optimal scalar loss weights are extremely fragile—small perturbations can cause accuracy drops of up to 17 percentage points. The framework achieves 73.9% kNN and 85.1% fine-tuning accuracy with ViT-Base at 300 epochs, demonstrating the effectiveness of their multi-objective approach.

Key Points
  • Combines three objectives: CLIP token distillation, global alignment, and pixel reconstruction for comprehensive training
  • Achieves 73.9% kNN accuracy on ImageNet-1K with ViT-Base at 300 epochs
  • Reveals loss weight fragility with small perturbations causing up to 17 percentage point accuracy drops

Why It Matters

Provides a more robust framework for training vision models that could improve downstream computer vision applications.