PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation
New system achieves 60% accuracy on tricky idiom tasks by adding lightweight modules to frozen CLIP.
A new research paper introduces PolyFrame, a system that dramatically improves how AI understands idioms across images and text. Presented at the MWE-2026 AdMIRe2 shared task, the work addresses a critical weakness in multimodal models: their struggle with non-literal expressions like 'kick the bucket' or 'spill the beans.'
The technical approach is notably efficient. Instead of fine-tuning massive vision-language models like CLIP (which would be computationally expensive), PolyFrame keeps these encoders frozen. It adds only lightweight modules: a logistic regression classifier, an LLM-based sentence-type predictor, idiom synonym substitution, and specialized scoring mechanisms. Starting from a CLIP baseline that achieved just 26.7% Top-1 accuracy on English development data, PolyFrame boosted performance to 60.0% Top-1. Crucially, it also demonstrated impressive zero-shot transfer, achieving 60.0% Top-1 accuracy (with a 0.822 NDCG@5 score) on Portuguese data without specific training.
On the final multilingual blind test across 15 languages, PolyFrame systems achieved average Top-1/NDCG scores of 0.35/0.73 for the image+text ranking task and 0.32/0.71 for the text-only caption ranking task. Ablation studies identified 'idiom-aware rewriting' as the most significant performance contributor, while sentence-type prediction and multimodal fusion techniques added robustness. This research provides a blueprint for enhancing AI's cultural and linguistic nuance without the prohibitive cost of retraining foundation models, paving the way for more context-aware assistants and translation tools.
- PolyFrame improved Top-1 accuracy on English idiom disambiguation from 26.7% to 60.0% using lightweight add-ons to frozen CLIP/BGE M3 models.
- The system achieved strong zero-shot transfer, scoring 60.0% Top-1 accuracy on Portuguese data without specific training for that language.
- Ablation shows idiom-aware paraphrasing was the key driver, proving major gains are possible without fine-tuning large multimodal encoders.
Why It Matters
Enables AI to grasp cultural nuance and metaphor without costly retraining, improving translation, content moderation, and assistive tools.