AI agents get claim-driven skills for distributed systems testing
Claude Code, Codex, Copilot CLI, and others now produce structured test plans with 9-state verdicts.
This workflow introduces two complementary skills for AI coding agents: one for designing a test plan and one for executing it. The plan skill begins with the system's claims (e.g., durability, linearizability), generates hypotheses tied to those claims, and writes scenarios named after the claim each tries to falsify. For consistency-critical scenarios, each scenario binds an abstract model (register, queue, log, etc.) to an operation-history schema, a named checker, and a nemesis with observable landing evidence. The plan ends with a coverage adequacy argument and a conservative confidence statement. The execute skill discovers existing tests, runbooks, and fault-injection scaffolding before inventing new ones, then runs each scenario and records per-scenario verdicts. The output is a structured Markdown report with a 9-state verdict set (e.g., PASS, FAIL, AMBIGUOUS) and blame tags (SUT, harness, checker, environment) so failures are routed to the correct team.
Why this matters: Traditional integration tests catch only a fraction of bugs in distributed systems (network partitions, crash-recovery, idempotency under replay). This approach enforces an opinionated workflow: claim-driven, not test-driven; coverage adequacy as a deliverable; and reuse of existing tooling. By requiring every scenario to falsify one claim under one fault and to cite oracle execution evidence for a PASS, it eliminates silent passes where "the chaos script ran cleanly" is mistaken for "the claim survived." The final plan includes a residual uncertainty section and a confidence statement, allowing a reviewer to decide whether to ship without re-running tests. The skills are plain SKILL.md files, compatible with Claude Code, Codex, Copilot CLI, Cursor, Gemini, or any agent that reads Markdown and runs shell.
- Two SKILL.md files: one designs a claim-driven test plan, the other executes scenarios and produces a findings report.
- Every scenario falsifies one product claim under one fault; consistency-critical scenarios require model + history + checker.
- 9-state verdicts with SUT/harness/checker/environment blame classification for precise failure reproduction.
Why It Matters
Enables rigorous, reproducible distributed systems testing with AI agents, replacing ad-hoc integration tests with claim-driven verification.