A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models
A new evaluation method shows LLM code generation quality drops 50% for constraint languages versus Python.
A team of researchers has introduced a novel framework for systematically evaluating how well large language models (LLMs) can generate code for Domain-Specific Languages (DSLs), which are specialized programming languages for particular tasks. The study, led by David Delgado, Lola Burgueño, and Robert Clarisó, focuses on a critical gap: while LLMs like GPT-4 and Claude excel at popular languages like Python, their performance plummets for less common DSLs, particularly constraint languages like OCL (Object Constraint Language) and Alloy. The framework assesses generated code on two core dimensions: syntactic well-formedness and functional correctness, providing a standardized way to benchmark models on these challenging tasks.
The experimental results reveal a stark performance divide. LLMs generally achieve much better results for Python than for OCL and Alloy. A key technical bottleneck identified is the limited context window of many open-source LLMs, which struggle to manage both the constraint definition and the domain model it references. The research also quantifies the impact of mitigation strategies, finding that techniques like iterative code repair and generating multiple candidate solutions can significantly boost quality, while choices like prompt templates have less effect. This work provides developers and organizations relying on DSLs with a concrete methodology to evaluate and optimize their AI-assisted coding workflows, moving beyond trial-and-error to data-driven configuration.
- LLM code generation performance is significantly worse for constraint DSLs (OCL, Alloy) than for general-purpose Python.
- Open-source LLMs with smaller context windows often fail entirely at these tasks due to needing to reference both constraints and domain models.
- Techniques like asking the model to repair its code (code repair) can improve output quality, while prompt engineering has less impact.
Why It Matters
Provides a blueprint for enterprises using niche languages to effectively integrate and benchmark AI coding assistants, avoiding costly errors.