Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
New 'visual chain-of-thought' method improves text grounding accuracy by showing where text is before reading it.
A research team led by Longwei Xu has introduced Q-Mask, a novel framework designed to solve a core weakness in AI's ability to understand text within images. While modern Vision-Language Models (VLMs) can perform Optical Character Recognition (OCR), they often fail at 'text anchoring'—precisely linking a piece of text to its exact location in an image. To diagnose this, the team first created TextAnchor-Bench (TABench), a benchmark that confirmed both general and OCR-specific VLMs struggle with reliable spatial grounding.
The solution is Q-Mask's Causal Query-driven Mask Decoder (CQMD). Inspired by chain-of-thought reasoning in language models, this decoder performs 'visual CoT.' Instead of reading text directly, it first generates a sequence of query-conditioned visual masks to pinpoint *where* the text is located, before determining *what* the text says. This forces the model to gather grounded visual evidence prior to recognition, explicitly building text anchors during inference.
To train this system, the researchers constructed TextAnchor-26M, a massive dataset of 26 million image-text pairs annotated with fine-grained masks that correspond to specific textual elements. This injects strong spatial priors into VLM training. Extensive experiments show that Q-Mask substantially improves both text anchoring accuracy and overall text understanding across diverse visual scenes, marking a step toward more reliable and interpretable document AI and visual question-answering systems.
- Introduces a 'visual chain-of-thought' method where the AI first locates text regions (the 'where') before recognizing the characters (the 'what').
- Trained on a new, large-scale dataset called TextAnchor-26M, containing 26 million image-text pairs with precise mask annotations.
- Addresses a key flaw identified by the team's new benchmark, TextAnchor-Bench, which showed current VLMs are poor at spatial text grounding.
Why It Matters
Enables more reliable AI for document analysis, visual QA, and any application requiring precise understanding of text in context.