Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs
A new training technique fixes AI's struggle to read text from images, raising math problem accuracy from 30% to over 92%.
A team of researchers from Johns Hopkins University and Meta has published a landmark study diagnosing a critical weakness in today's multimodal large language models (MLLMs). Models like GPT-4V, Claude 3, and Gemini, which can process both text and images, perform significantly worse when the text they need to reason about is presented as an image versus plain text tokens. This 'modality gap' was systematically measured across seven MLLMs and seven benchmarks, revealing it's highly task- and data-dependent. For instance, on synthetic math problems, performance degraded by over 60 percentage points, while on realistic document images from arXiv or Wikipedia, it was often comparable to text mode.
Through a grounded-theory analysis of over 4,000 errors, the team discovered that visual input selectively amplifies 'reading' errors—like misreading characters or failing to parse formatting—while leaving 'thinking' errors (knowledge and reasoning) unchanged. Some models even exhibited a 'chain-of-thought reasoning collapse,' where their step-by-step logic fell apart under visual input. The most impactful finding was that seemingly minor rendering choices, like font style, could swing accuracy by up to 47 points, highlighting the fragility of current vision encoders.
To bridge this gap, the researchers developed a novel 'self-distillation' training method. The technique involves having the model generate correct reasoning traces (chain-of-thought) using pure text input, then using those traces as training targets when the same problem is presented as an image. This teaches the vision encoder to extract textual information as reliably as the text tokenizer. The results were dramatic: on the GSM8K math benchmark, image-mode accuracy for their model skyrocketed from 30.71% to 92.72%, effectively closing the modality gap. Crucially, this improvement transferred to other, unseen benchmarks without causing catastrophic forgetting of other skills.
This work provides both a detailed map of a major limitation in current multimodal AI and a practical, scalable solution. The self-distillation method requires no additional human annotation and leverages the model's own capabilities, pointing toward a future where AI can truly understand documents in any format they appear.
- Diagnosed a 'modality gap' where MLLMs perform worse on text-in-images vs. pure text, with math accuracy dropping over 60 points on synthetic data.
- Found rendering choices like font are major confounders, swinging accuracy by up to 47 percentage points, and visual input causes 'reasoning collapse' in some models.
- Proposed 'self-distillation' method trains models on their own text-based reasoning paired with images, boosting GSM8K accuracy from 30.71% to 92.72%.
Why It Matters
This directly improves AI's ability to analyze real-world documents like PDFs, screenshots, and scanned forms, unlocking more reliable enterprise and research applications.