Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages
Major VLMs like GPT-4o show 9.8-25% accuracy drops when reasoning in Hindi, Tamil, and Bengali versus English.
A new research paper titled "Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages" by Swastik R presents the first comprehensive evaluation of Vision-Language Models (VLMs) on non-English visual reasoning. The study translated 980 questions from established benchmarks like MathVista, ScienceQA, and MMMU into six major Indian languages—Hindi, Tamil, Telugu, Bengali, Kannada, and Marathi—using the IndicTrans2 model, with cross-verification via Gemini 2.0 Flash.
The audit evaluated eight VLMs, ranging from 7B-parameter open-source models to GPT-4o, generating 68,600 inference records. The results are stark: when switching from English to an Indian language, model accuracy dropped between 9.8 and 25 percentage points. A key finding is that Dravidian languages (like Tamil and Telugu) suffered up to 13.2 percentage points more degradation than Indo-Aryan languages (like Hindi and Bengali).
Surprisingly, chain-of-thought prompting, a technique designed to improve reasoning by having the model 'think step-by-step,' actually harmed performance in languages like Bengali (-14.4 pp) and Kannada (-11.4 pp). This suggests the reasoning chains themselves are English-centric and do not transfer. Even Aya-Vision-8B, a model explicitly built for 23 languages, dropped 28.5 percentage points on Dravidian scripts, proving that multilingual pretraining alone does not guarantee visual reasoning capability.
The researcher has released the translated benchmark and all model outputs, providing a crucial resource for developers aiming to build truly equitable, multilingual AI systems. This work exposes a critical blind spot in AI evaluation and highlights the need for reasoning capabilities that are language-agnostic, not just trained on more text.
- Accuracy drops of 9.8-25 percentage points for VLMs like GPT-4o when reasoning in Hindi, Tamil, or Bengali versus English.
- Chain-of-thought prompting degrades performance in Bengali (-14.4 pp) and Kannada (-11.4 pp), revealing English-centric reasoning logic.
- Even specialized multilingual models like Aya-Vision-8B fail, dropping 28.5 pp on Dravidian scripts, showing pretraining isn't enough for visual reasoning.
Why It Matters
Exposes a major equity gap in AI; models fail billions of non-English speakers on real-world visual reasoning tasks, demanding new training approaches.