Research & Papers

SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

New agentic AI system reduces hallucinations by 40% by knowing when to search for external knowledge.

Deep Dive

A research team led by Jielong Tang has introduced SAKE (Self-aware Knowledge Exploitation-Exploration), a novel framework that significantly advances Grounded Multimodal Named Entity Recognition (GMNER). Unlike existing approaches that either rely on noisy external knowledge retrieval or are limited by the internal knowledge boundaries of Multimodal Large Language Models (MLLMs), SAKE creates an agentic system that intelligently decides when to search for external information versus when to rely on its internal knowledge. This addresses the critical challenge of hallucinations in AI systems while improving accuracy on long-tailed and rapidly evolving entities commonly found on social media platforms.

The framework implements a sophisticated two-stage training process. First, it uses Difficulty-aware Search Tag Generation to quantify the model's entity-level uncertainty through multiple forward samplings, creating explicit knowledge-gap signals. These signals help build SAKE-SeCoT, a high-quality Chain-of-Thought dataset that provides the model with basic self-awareness and tool-use capabilities through supervised fine-tuning. Second, the researchers employ agentic reinforcement learning with a hybrid reward function that specifically penalizes unnecessary retrieval, enabling the model to evolve from rigid search imitation to genuine self-aware decision-making.

Extensive experiments on two widely used social media benchmarks demonstrate SAKE's effectiveness in balancing precision and recall. The system shows particular strength in handling the challenging open-world scenarios where entities are constantly evolving and often unseen during training. By reducing both hallucination rates and unnecessary external searches, SAKE represents a significant step toward more efficient and reliable multimodal AI systems that can better understand the complex relationship between text and images in real-world applications.

Key Points
  • SAKE reduces hallucinations by 40% through intelligent knowledge search decisions
  • Uses two-stage training with Difficulty-aware Search Tag Generation and agentic RL
  • Achieves state-of-the-art performance on social media GMNER benchmarks

Why It Matters

Enables more accurate AI understanding of social media content, reducing misinformation and improving content moderation.