Research & Papers

What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance

arXiv cs.CL February 25, 2026

⚡Research on 370k queries reveals how complex, vague prompts trigger AI errors, while clear ones reduce risk.

Deep Dive

A team from Carnegie Mellon University, led by William Watson and Nicole Cho, published a groundbreaking study titled 'What Makes a Good Query?' that shifts the blame for AI hallucinations from model defects to query design. Analyzing 369,837 real-world prompts, the research establishes that a query's linguistic form—not just the model's architecture—significantly shapes the likelihood of incorrect or fabricated responses. The work draws on classical linguistics to argue that features confusing to humans also confuse LLMs like GPT-4 and Claude, creating a measurable 'risk landscape' for hallucinations.

The study operationalizes this by constructing a 22-dimension feature vector analyzing elements like clause nesting, lexical rarity, anaphora (references like 'it' or 'they'), negation, and intention grounding. The large-scale analysis revealed consistent patterns: queries with deep syntactic nesting and underspecified references led to higher hallucination propensity, while those with clear intention grounding and direct answerability showed lower error rates. The findings, accepted for EACL 2026, provide an empirical foundation for developing guided query-rewriting systems and pre-prompt interventions, potentially reducing AI errors before a model even generates a response.

Key Points

Analyzed 369,837 real-world queries to build a 22-dimension linguistic feature vector linked to hallucination risk.
Found deep clause nesting and underspecification increase errors, while clear intention grounding reduces them by measurable margins.
Establishes a framework for future tools to automatically rewrite confusing prompts before they reach models like GPT-4 or Llama.

Why It Matters

Enables the development of pre-processing tools to rewrite confusing user prompts, drastically improving AI reliability and trust for enterprise applications.

Read Original Article

What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance

Why It Matters

Stay Ahead in AI