Prompt strategy beats model size in LLM microservice testing study
Structured prompts collapse diversity, but guided few-shot nails 8 of 14 failure modes.
Researchers systematically evaluated how LLMs can generate robustness tests for microservice APIs by feeding malformed, missing, or boundary-value inputs. They applied 7 prompt strategies to 3 open-source LLMs (from 14B to 70B parameters) targeting two distinct systems: a 6-service Java monolingual app with 9 known failure modes and a 27-service polyglot app with 14 failure modes. Across 38 valid runs generating 663 tests, the team found that prompt strategy explains significantly more variation in failure-mode diversity than model size. A Structured prompt (e.g., requiring strict JSON format) collapsed diversity entirely, while a single model varied across three prompt strategies achieved complete failure-mode coverage on one system—beating any multi-model ensemble using a fixed prompt.
Two novel strategies, Guided and GuidedFewShot, embedded a mutation taxonomy from prior robustness research as domain context. GuidedFewShot achieved the highest single-run coverage on both systems (5 of 9 and 8 of 14 failure modes) while maintaining low test similarity across models. A key takeaway: taxonomy rules alone aren't enough—LLMs need concrete examples to distinguish key-absent from value-empty mutations. The findings replicate across both systems, confirming that prompt engineering (not just model scaling) is the critical lever for effective LLM-based testing.
- Prompt strategy explains more failure-diversity variation than model size (14B vs 70B) across 38 runs with 663 tests.
- GuidedFewShot strategy achieved highest single-run coverage: 5/9 failure modes on Java system and 8/14 on polyglot system.
- Structured prompt collapsed diversity entirely; a single model with varying prompts outperformed any multi-model ensemble under fixed prompt.
Why It Matters
For engineering teams: smarter prompt design matters more than bigger models for automated microservice testing.