Research & Papers

PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation

arXiv cs.CL February 24, 2026

⚡New system achieves 60% accuracy on tricky idiom tasks by adding lightweight modules to frozen CLIP.

Deep Dive

A new research paper introduces PolyFrame, a system that dramatically improves how AI understands idioms across images and text. Presented at the MWE-2026 AdMIRe2 shared task, the work addresses a critical weakness in multimodal models: their struggle with non-literal expressions like 'kick the bucket' or 'spill the beans.'

The technical approach is notably efficient. Instead of fine-tuning massive vision-language models like CLIP (which would be computationally expensive), PolyFrame keeps these encoders frozen. It adds only lightweight modules: a logistic regression classifier, an LLM-based sentence-type predictor, idiom synonym substitution, and specialized scoring mechanisms. Starting from a CLIP baseline that achieved just 26.7% Top-1 accuracy on English development data, PolyFrame boosted performance to 60.0% Top-1. Crucially, it also demonstrated impressive zero-shot transfer, achieving 60.0% Top-1 accuracy (with a 0.822 NDCG@5 score) on Portuguese data without specific training.

On the final multilingual blind test across 15 languages, PolyFrame systems achieved average Top-1/NDCG scores of 0.35/0.73 for the image+text ranking task and 0.32/0.71 for the text-only caption ranking task. Ablation studies identified 'idiom-aware rewriting' as the most significant performance contributor, while sentence-type prediction and multimodal fusion techniques added robustness. This research provides a blueprint for enhancing AI's cultural and linguistic nuance without the prohibitive cost of retraining foundation models, paving the way for more context-aware assistants and translation tools.

Key Points

PolyFrame improved Top-1 accuracy on English idiom disambiguation from 26.7% to 60.0% using lightweight add-ons to frozen CLIP/BGE M3 models.
The system achieved strong zero-shot transfer, scoring 60.0% Top-1 accuracy on Portuguese data without specific training for that language.
Ablation shows idiom-aware paraphrasing was the key driver, proving major gains are possible without fine-tuning large multimodal encoders.

Why It Matters

Enables AI to grasp cultural nuance and metaphor without costly retraining, improving translation, content moderation, and assistive tools.

Read Original Article

PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation

Why It Matters

Stay Ahead in AI