COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives
Researchers combine multiple LLMs and prompting strategies to match human judgment in semantic evaluation.
Researchers Azwad Anjum Islam and Tisa Islam Erana developed the COGNAC system for SemEval-2026 Task 5, a competition focused on rating the plausibility of word senses in challenging narratives. The task required systems to evaluate homonyms in short stories using a 5-point Likert scale, with performance measured by accuracy within one standard deviation of human judgments and Spearman Rank Correlation. The team's approach involved creating ensembles of multiple closed-source commercial LLMs (though specific models aren't named in the abstract) and testing three distinct prompting strategies: a baseline zero-shot setup, Chain-of-Thought (CoT) prompting with structured reasoning, and a novel comparative prompting method that evaluates candidate word senses simultaneously.
Their official competition system, which combined LLM ensembles across all three prompting strategies, achieved 4th place on the leaderboard with 0.88 accuracy and 0.83 Spearman's rho, for a 0.86 average score. Post-competition experiments with additional models pushed performance even higher to 0.92 accuracy and 0.85 Spearman's rho, resulting in a 0.89 average. The research revealed that comparative prompting consistently improved performance across different model families, and that model ensembling significantly enhanced alignment with mean human judgments. This suggests LLM ensembles are particularly well-suited for subjective semantic evaluation tasks where multiple human annotators might disagree, effectively capturing the nuance and variation inherent in human language understanding.
- COGNAC system placed 4th in SemEval-2026 Task 5 with 0.88 accuracy and 0.83 Spearman correlation
- Post-competition improvements reached 0.92 accuracy and 0.85 correlation using additional LLMs
- Comparative prompting and model ensembling proved most effective for handling subjective semantic tasks
Why It Matters
Demonstrates LLMs can achieve human-level performance on nuanced language understanding tasks, advancing semantic AI applications.