Behaviour Driven Development Scenario Generation with Large Language Models
Research finds Claude 3 outperforms GPT-4 and Gemini in generating high-quality Behavior-Driven Development test scenarios.
A new research paper from Monash University provides a comprehensive evaluation of large language models for automating a critical software engineering task: generating Behavior-Driven Development (BDD) scenarios. The study, led by Amila Rathnayake, Mojtaba Shahin, and Golnoush Abaei, tested GPT-4, Claude 3, and Gemini against a proprietary dataset of 500 user stories and requirements from four real software products. The findings reveal a surprising result: while GPT-4 scored higher on traditional text and semantic similarity metrics, Claude 3 consistently produced BDD scenarios that were rated highest by both human software engineering experts and LLM-based evaluators like DeepSeek.
The research establishes several practical guidelines for developers. First, prompting effectiveness is highly model-specific: GPT-4 performs best with zero-shot prompts, Claude 3 benefits from chain-of-thought reasoning, and Gemini achieves optimal results with few-shot examples. Second, input quality is paramount—detailed requirement descriptions alone yield high-quality scenarios, whereas user stories alone produce low-quality output. The study also found that setting temperature to 0 and top_p to 1.0 produced the highest-quality scenarios across all models. These findings provide concrete, evidence-based recommendations for integrating LLMs into software development workflows, potentially saving significant engineering time while improving test coverage and specification accuracy.
- Claude 3 outperformed GPT-4 and Gemini in human expert evaluations for BDD scenario generation, despite GPT-4 scoring higher on automated similarity metrics.
- Optimal prompting techniques differ by model: zero-shot for GPT-4, chain-of-thought for Claude 3, and few-shot for Gemini, with temperature=0 and top_p=1.0 working best universally.
- LLM-based evaluators like DeepSeek showed stronger correlation with human judgment than traditional text similarity metrics, validating their use for automated quality assessment.
Why It Matters
Provides evidence-based guidelines for developers to automate software testing, potentially saving hundreds of engineering hours on specification writing.