Prompt strategy explains more failure-diversity variation than model size (14B vs 70B) across 38 runs with 663 tests?

Prompt strategy explains more failure-diversity variation than model size (14B vs 70B) across 38 runs with 663 tests.

GuidedFewShot strategy achieved highest single-run coverage?

5/9 failure modes on Java system and 8/14 on polyglot system.

Structured prompt collapsed diversity entirely; a single model with varying prompts outperformed any multi-model ensemble under fixed prompt?

Structured prompt collapsed diversity entirely; a single model with varying prompts outperformed any multi-model ensemble under fixed prompt.

Developer Tools

Prompt strategy beats model size in LLM microservice testing study

arXiv cs.SE May 15, 2026

⚡Structured prompts collapse diversity, but guided few-shot nails 8 of 14 failure modes.

Deep Dive

Researchers systematically evaluated how LLMs can generate robustness tests for microservice APIs by feeding malformed, missing, or boundary-value inputs. They applied 7 prompt strategies to 3 open-source LLMs (from 14B to 70B parameters) targeting two distinct systems: a 6-service Java monolingual app with 9 known failure modes and a 27-service polyglot app with 14 failure modes. Across 38 valid runs generating 663 tests, the team found that prompt strategy explains significantly more variation in failure-mode diversity than model size. A Structured prompt (e.g., requiring strict JSON format) collapsed diversity entirely, while a single model varied across three prompt strategies achieved complete failure-mode coverage on one system—beating any multi-model ensemble using a fixed prompt.

Two novel strategies, Guided and GuidedFewShot, embedded a mutation taxonomy from prior robustness research as domain context. GuidedFewShot achieved the highest single-run coverage on both systems (5 of 9 and 8 of 14 failure modes) while maintaining low test similarity across models. A key takeaway: taxonomy rules alone aren't enough—LLMs need concrete examples to distinguish key-absent from value-empty mutations. The findings replicate across both systems, confirming that prompt engineering (not just model scaling) is the critical lever for effective LLM-based testing.

Key Points

Prompt strategy explains more failure-diversity variation than model size (14B vs 70B) across 38 runs with 663 tests.
GuidedFewShot strategy achieved highest single-run coverage: 5/9 failure modes on Java system and 8/14 on polyglot system.
Structured prompt collapsed diversity entirely; a single model with varying prompts outperformed any multi-model ensemble under fixed prompt.

Why It Matters

For engineering teams: smarter prompt design matters more than bigger models for automated microservice testing.

Read Original Article

Prompt strategy beats model size in LLM microservice testing study

Why It Matters

Related Articles

🚀 Stay Ahead in AI