Research & Papers

BEiTScore: Lightweight AI metric beats LLM judges for image captioning

No more expensive LLM judges – a lightweight cross-encoder matches SOTA with 90% less compute.

Deep Dive

Evaluating image captions is becoming harder as models generate long, context-rich descriptions. Current state-of-the-art metrics rely on large language models (LLMs) as judges, which are computationally expensive, or on CLIP-based encoders that suffer from token limits and lack of compositional understanding. Researchers from IST (Gonçalo Gomes, Bruno Martins, Chrysoula Zerva) propose BEiTScore, a new learned metric that tackles these challenges with a lightweight cross-encoder architecture.

BEiTScore is initialized from a visual question-answering model checkpoint, balancing strong weight initialization with computational efficiency. The training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance sensitivity to fine-grained visual-linguistic errors. The team also introduces a new benchmark for detailed captioning evaluation across diverse scenarios. Experimental results show BEiTScore achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.

Key Points
  • BEiTScore is a reference-free metric – no human-written captions needed for evaluation.
  • Initialized from a VQA checkpoint, uses a lightweight cross-encoder instead of costly LLM judges.
  • Adversarial LLM-based data augmentations improve detection of fine-grained visual-linguistic errors.

Why It Matters

Faster, cheaper caption evaluation enables practical large-scale benchmarking and reward-guided AI training without sacrificing accuracy.