Developer Tools

Behaviour Driven Development Scenario Generation with Large Language Models

arXiv cs.SE March 06, 2026

⚡Research finds Claude 3 outperforms GPT-4 and Gemini in generating high-quality Behavior-Driven Development test scenarios.

Deep Dive

A new research paper from Monash University provides a comprehensive evaluation of large language models for automating a critical software engineering task: generating Behavior-Driven Development (BDD) scenarios. The study, led by Amila Rathnayake, Mojtaba Shahin, and Golnoush Abaei, tested GPT-4, Claude 3, and Gemini against a proprietary dataset of 500 user stories and requirements from four real software products. The findings reveal a surprising result: while GPT-4 scored higher on traditional text and semantic similarity metrics, Claude 3 consistently produced BDD scenarios that were rated highest by both human software engineering experts and LLM-based evaluators like DeepSeek.

The research establishes several practical guidelines for developers. First, prompting effectiveness is highly model-specific: GPT-4 performs best with zero-shot prompts, Claude 3 benefits from chain-of-thought reasoning, and Gemini achieves optimal results with few-shot examples. Second, input quality is paramount—detailed requirement descriptions alone yield high-quality scenarios, whereas user stories alone produce low-quality output. The study also found that setting temperature to 0 and top_p to 1.0 produced the highest-quality scenarios across all models. These findings provide concrete, evidence-based recommendations for integrating LLMs into software development workflows, potentially saving significant engineering time while improving test coverage and specification accuracy.

Key Points

Claude 3 outperformed GPT-4 and Gemini in human expert evaluations for BDD scenario generation, despite GPT-4 scoring higher on automated similarity metrics.
Optimal prompting techniques differ by model: zero-shot for GPT-4, chain-of-thought for Claude 3, and few-shot for Gemini, with temperature=0 and top_p=1.0 working best universally.
LLM-based evaluators like DeepSeek showed stronger correlation with human judgment than traditional text similarity metrics, validating their use for automated quality assessment.

Why It Matters

Provides evidence-based guidelines for developers to automate software testing, potentially saving hundreds of engineering hours on specification writing.

Read Original Article

Behaviour Driven Development Scenario Generation with Large Language Models

Why It Matters

Stay Ahead in AI