IntentGrasp includes 262,759 training and 12,909 test cases across 12 domains, plus a 470-case hard set (Gem Set)?

IntentGrasp includes 262,759 training and 12,909 test cases across 12 domains, plus a 470-case hard set (Gem Set).

20 LLMs (GPT-5.4, Gemini-3.1-Pro, Claude-Opus-4.7) scored <60% on All Set and <25% on Gem Set; 17/20 worse than random guess?

20 LLMs (GPT-5.4, Gemini-3.1-Pro, Claude-Opus-4.7) scored <60% on All Set and <25% on Gem Set; 17/20 worse than random guess.

Intentional Fine-Tuning (IFT) boosts F1 by 30+ points (All Set) and 20+ points (Gem Set) with strong cross-domain generalization?

Intentional Fine-Tuning (IFT) boosts F1 by 30+ points (All Set) and 20+ points (Gem Set) with strong cross-domain generalization.

Research & Papers

New IntentGrasp benchmark exposes LLMs' intent understanding failure

Q: Intentional Fine-Tuning (IFT) boosts F1 by 30+ points (All Set) and 20+ points (Gem Set) with strong cross-domain generalization?

Intentional Fine-Tuning (IFT) boosts F1 by 30+ points (All Set) and 20+ points (Gem Set) with strong cross-domain generalization.

arXiv cs.CL May 11, 2026

⚡Top models score below 60%, and 17/20 beat random chance by 0%.

Deep Dive

A new research paper from Yuwei Yin, Chuyuan Li, and Giuseppe Carenini at the University of British Columbia introduces IntentGrasp, a comprehensive benchmark designed to measure how well large language models (LLMs) understand user intent. Built from 49 high-quality open-licensed corpora spanning 12 diverse domains (e.g., customer service, task planning, dialogue), the benchmark includes a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced, challenging Gem Set of 470 cases. The authors evaluated 20 LLMs across 7 model families, including frontier models like GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7, and found uniformly poor performance: all models scored below 60% F1 on the All Set and below 25% on the Gem Set. Strikingly, 17 out of 20 tested models performed worse than a random-guess baseline of 15.2% on the Gem Set, while estimated human performance stands at ~81.1%.

To address this gap, the authors propose Intentional Fine-Tuning (IFT), a method that fine-tunes models on the IntentGrasp training set using a unified task format. IFT yields dramatic improvements: over 30 F1 points on the All Set and over 20 points on the Gem Set, with the best fine-tuned models approaching or exceeding human-level performance on the All Set. Leave-one-domain-out (Lodo) experiments show strong cross-domain generalizability, meaning that the improvements aren't just memorization of training domains. The paper argues that improving intent understanding is a critical step toward more intentional, capable, and safe AI assistants. The dataset and code are released on Hugging Face and GitHub.

Key Points

IntentGrasp includes 262,759 training and 12,909 test cases across 12 domains, plus a 470-case hard set (Gem Set).
20 LLMs (GPT-5.4, Gemini-3.1-Pro, Claude-Opus-4.7) scored <60% on All Set and <25% on Gem Set; 17/20 worse than random guess.
Intentional Fine-Tuning (IFT) boosts F1 by 30+ points (All Set) and 20+ points (Gem Set) with strong cross-domain generalization.

Why It Matters

Current LLMs fail at basic intent understanding—this benchmark and fine-tuning method can unlock truly helpful AI assistants.

Read Original Article

New IntentGrasp benchmark exposes LLMs' intent understanding failure

Why It Matters

Related Articles

🚀 Stay Ahead in AI