Study finds embedded numbers bias 6 major VLMs 2.5x more than image degradation
Numeric anchors on images skew VLM quality judgments worse than severe blur or noise.
A new study by M. Shalankin, published on arXiv, systematically demonstrates that Vision-Language Models (VLMs) exhibit a visual anchoring bias when images contain embedded numeric anchors. Testing six VLMs from five different architectural families, the researchers found that numeric cues like scores or ratings sway the model's quality judgments dramatically — with ANOVA eta² values ranging from 0.18 to 0.77 (all p < 0.001). Critically, the anchoring effect was 2.5 times larger than the effect of severe image quality degradation (e.g., heavy blur or noise), confirming the bias is cognitive rather than purely visual.
Using layer-wise probing, the paper reveals a clear dissociation: layers where anchor classification saturates (typically layers L12–L34) are suboptimal for actual quality prediction. The optimal layers for quality run deeper, with R² values between 0.69 and 0.91, suggesting VLMs separate these tasks across depth. A fusion analysis further shows architecture-dependent integration patterns — two models fuse anchor and quality signals as early as L1–L2, while three others show partial or no fusion at all, indicating that visual anchoring bias is not a monolithic flaw but varies by design.
These findings provide a causal account of a previously underexplored vulnerability in VLMs. As these models are increasingly used for content moderation, image generation evaluation, and automated quality scoring, the presence of numbers in images could silently skew outputs. The study calls for careful handling of numerical elements in training data and prompts, and suggests that deeper layers may offer more reliable quality assessments.
- Anchor effects are 2.5x larger than severe image quality degradation in all six tested VLMs.
- Optimal quality prediction layers (R² = 0.69–0.91) are consistently deeper than anchor classification saturation layers (L12–L34).
- Integration of anchor and quality signals varies by architecture: instant fusion in two models (L1–L2) vs. partial or none in three others.
Why It Matters
VLMs cannot be trusted when images contain numbers — a critical flaw for automated quality scoring and content moderation.