MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering
New 'Mask-and-Select' method cuts irrelevant data, improving answers on complex visual questions.
Researchers led by Xianwei Mao developed MaS-VQA, a framework for Knowledge-Based Visual Question Answering. It uses a 'Mask-and-Select' mechanism to prune irrelevant image regions and noisy text from retrieved knowledge bases. This creates a compact, high-signal input that better guides a multimodal large language model's internal reasoning. The system showed consistent performance gains on benchmarks like Encyclopedic-VQA, proving it enhances how AI uses external facts with visual data.
Why It Matters
It makes AI vision systems more reliable for real-world tasks requiring deep, factual understanding of images.