Learning to Select Visual In-Context Demonstrations
A new AI agent learns to select optimal examples, boosting MLLM performance on factual tasks by 15%.
A team of researchers has developed a novel method called Learning to Select Demonstrations (LSD) that fundamentally improves how Multimodal Large Language Models (MLLMs) learn from examples. The standard approach, unsupervised k-Nearest Neighbor (kNN) search, selects examples based purely on visual similarity, which often leads to redundant, unhelpful demonstrations for complex tasks like factual regression. The researchers reframed the problem as one of sequential decision-making and trained a Reinforcement Learning agent—specifically a Dueling Deep Q-Network (DQN) with a query-centric Transformer Decoder—to learn a policy for constructing optimal demonstration sets that maximize the MLLM's final performance.
Evaluated across five visual regression benchmarks, the study revealed a crucial finding: while simple kNN remains sufficient for subjective preference tasks, the learned LSD agent significantly outperforms all baselines on objective, factual regression tasks. This is because LSD intelligently balances visual relevance with output diversity, selecting a set of examples that better defines the true boundaries of the regression problem. The work, accepted to CVPR 2026, provides a clear framework for when advanced, learned selection is necessary, moving beyond one-size-fits-all strategies for visual in-context learning.
- Replaces standard k-Nearest Neighbor search with a Reinforcement Learning agent (Dueling DQN) to select examples.
- Uncovered a key dichotomy: kNN works for subjective tasks, but LSD is superior for objective, factual visual regression.
- Tested on five benchmarks, showing LSD selects more diverse and relevant examples to better define task boundaries.
Why It Matters
Enables more accurate and reliable visual AI for critical applications like medical imaging, scientific analysis, and autonomous systems.