New IntentGrasp benchmark exposes LLMs' intent understanding failure
Top models score below 60%, and 17/20 beat random chance by 0%.
A new research paper from Yuwei Yin, Chuyuan Li, and Giuseppe Carenini at the University of British Columbia introduces IntentGrasp, a comprehensive benchmark designed to measure how well large language models (LLMs) understand user intent. Built from 49 high-quality open-licensed corpora spanning 12 diverse domains (e.g., customer service, task planning, dialogue), the benchmark includes a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced, challenging Gem Set of 470 cases. The authors evaluated 20 LLMs across 7 model families, including frontier models like GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7, and found uniformly poor performance: all models scored below 60% F1 on the All Set and below 25% on the Gem Set. Strikingly, 17 out of 20 tested models performed worse than a random-guess baseline of 15.2% on the Gem Set, while estimated human performance stands at ~81.1%.
To address this gap, the authors propose Intentional Fine-Tuning (IFT), a method that fine-tunes models on the IntentGrasp training set using a unified task format. IFT yields dramatic improvements: over 30 F1 points on the All Set and over 20 points on the Gem Set, with the best fine-tuned models approaching or exceeding human-level performance on the All Set. Leave-one-domain-out (Lodo) experiments show strong cross-domain generalizability, meaning that the improvements aren't just memorization of training domains. The paper argues that improving intent understanding is a critical step toward more intentional, capable, and safe AI assistants. The dataset and code are released on Hugging Face and GitHub.
- IntentGrasp includes 262,759 training and 12,909 test cases across 12 domains, plus a 470-case hard set (Gem Set).
- 20 LLMs (GPT-5.4, Gemini-3.1-Pro, Claude-Opus-4.7) scored <60% on All Set and <25% on Gem Set; 17/20 worse than random guess.
- Intentional Fine-Tuning (IFT) boosts F1 by 30+ points (All Set) and 20+ points (Gem Set) with strong cross-domain generalization.
Why It Matters
Current LLMs fail at basic intent understanding—this benchmark and fine-tuning method can unlock truly helpful AI assistants.