MaS-VQA framework boosts AI vision accuracy by filtering noisy knowledge
New 'Mask-and-Select' method cuts irrelevant data, improving answers on complex visual questions.
Researchers led by Xianwei Mao developed MaS-VQA, a framework for Knowledge-Based Visual Question Answering. It uses a 'Mask-and-Select' mechanism to prune irrelevant image regions and noisy text from retrieved knowledge bases. This creates a compact, high-signal input that better guides a multimodal large language model's internal reasoning. The system showed consistent performance gains on benchmarks like Encyclopedic-VQA, proving it enhances how AI uses external facts with visual data.
Why It Matters
It makes AI vision systems more reliable for real-world tasks requiring deep, factual understanding of images.