Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
New benchmark shows even top models fail when requirements are vague.
A team of researchers from multiple institutions (including Di Yang, Xinou Xie, and others) has introduced Orchid, the first code generation benchmark specifically designed to test Large Language Models (LLMs) against ambiguous software requirements. Orchid comprises 1,304 function-level tasks covering four distinct types of ambiguity: lexical, syntactic, semantic, and vagueness. Unlike existing benchmarks that assume clear, precise specifications, Orchid reflects the real-world reality where software requirements are often vague or conflicting.
In their systematic empirical study, the team evaluated several LLMs on Orchid and found that ambiguity consistently degrades performance across all models, with the most advanced LLMs experiencing the most pronounced negative effects. Notably, LLMs frequently produce functionally divergent implementations for the same ambiguous requirement and lack the ability to identify or resolve such ambiguity autonomously. These findings highlight a significant performance gap between clear and ambiguous requirements, underscoring the urgent need for ambiguity-aware techniques in next-generation automated software engineering tools. The Orchid benchmark is publicly available.
- Orchid includes 1,304 function-level tasks with four ambiguity types: lexical, syntactic, semantic, and vagueness.
- Ambiguity degraded performance across all tested LLMs, with the most advanced models suffering the largest drops.
- LLMs produced divergent implementations for the same ambiguous requirement and could not autonomously detect or resolve ambiguity.
Why It Matters
This exposes a fundamental blind spot in LLM code generation, threatening reliability in real-world software engineering.