SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read
Researchers discover AI models ignore text in images, then create a fix that works with 30x less data.
A research team led by Yibo Peng and Peng Xia has published a groundbreaking paper revealing a critical flaw in today's Multimodal Large Language Models (MLLMs) like Qwen2.5-VL. Their diagnostic experiments expose a phenomenon they term 'modality laziness'—despite possessing strong Optical Character Recognition (OCR) capabilities, these models often ignore text embedded in images and instead rely on shortcuts from the accompanying text prompts. When researchers rendered text queries directly onto images (creating Visualized Questions or VQs), model performance degraded by up to 12.7%, proving the models weren't genuinely 'reading' visual text. This finding challenges assumptions about how MLLMs process multimodal information.
To solve this, the team developed SimpleOCR, a novel plug-and-play training strategy that imposes a structural constraint during learning. The method transforms standard training samples into the VQ format with randomized text styles (fonts, colors, positions), effectively invalidating text-based shortcuts and forcing models to activate their visual text extraction pathways. Remarkably, SimpleOCR requires no architectural changes to existing models. It outperforms the base Qwen2.5-VL model by 5.4% on four out-of-distribution benchmarks and beats GRPO training on original images by 2.7%. Most impressively, it demonstrates extreme data efficiency, achieving superior performance with just 8.5K samples—30 times fewer than recent reinforcement learning (RL) based methods. The approach is also compatible with advanced RL strategies like NoisyRollout for further gains, offering a practical upgrade path for existing MLLM deployments.
- Diagnosed 'modality laziness' in MLLMs: Qwen2.5-VL performance dropped 12.7% when forced to read text from images.
- SimpleOCR plug-and-play training uses 8.5K samples (30x fewer than RL methods) and improves benchmark scores by 5.4%.
- Technique renders text queries onto images with random styles, forcing models to use visual pathways without architectural changes.
Why It Matters
Enables more reliable AI that actually reads text in images for applications like document analysis, accessibility tools, and autonomous systems.