Research & Papers

MULTITEXTEDIT benchmark reveals AI image editors fail on non-English text

3,600 test images across 12 languages show all 12 models degrade on script fidelity

Deep Dive

A team of researchers (Cheng et al.) has released MULTITEXTEDIT, a controlled benchmark designed to expose how poorly text-in-image editing systems handle non-English languages. The dataset contains 3,600 curated instances across 12 typologically diverse languages (including Hebrew, Arabic, Dutch, Spanish), 5 visual domains, and 7 editing operations, all sharing a common visual base with human-edited references and region masks to isolate language effects.

To capture script-level errors that standard text-matching metrics miss—such as missing diacritics, reversed right-to-left order, and mixed-script renderings—the authors propose a new Language Fidelity (LSF) metric. It uses a two-stage LVM (Language Vision Model) protocol that first traces the edited target text and then judges it in isolation, achieving a quadratic-weighted kappa of 0.76 against native-speaker annotators. When they evaluated 12 open-source and proprietary systems using LSF alongside standard semantic and pixel-based metrics, every model showed pronounced cross-lingual degradation. The largest drops occurred for Hebrew and Arabic (RTL scripts with complex diacritics), while Dutch and Spanish saw the smallest gaps. Degradation was concentrated in text accuracy and script fidelity rather than coarse structural dimensions, revealing a pervasive semantic and pixel mismatch: outputs preserve global layout and background fidelity yet distort script-specific forms. The findings underscore that current text-in-image generation is fundamentally English-centric and will require new approaches to script diversity.

Key Points
  • MULTITEXTEDIT includes 3,600 instances across 12 languages, 5 visual domains, and 7 editing operations
  • New LSF metric uses a two-stage LVM protocol reaching 0.76 agreement with human annotators on script errors
  • All 12 tested models degraded most on Hebrew/Arabic and least on Dutch/Spanish, with errors in diacritics, RTL order, and mixed-script

Why It Matters

AI image editors remain de facto English-only; this benchmark gives developers a roadmap to fix language bias.