Three confounds caught by the substrate?

a pgvector index bug, a +19% selection-bias artifact, and 3x overstated small-sample estimates.

Retrieval compounding result?

only 1 of 3 scenarios significant (p<0.05), pooled effect +3.9 pp, not significant at n=60.

Framework uses deterministic-embedder control and pre-registered decision matrices to enforce falsifiability; 360-run sweep showed near-neighbor alignment beats raw count?

Framework uses deterministic-embedder control and pre-registered decision matrices to enforce falsifiability; 360-run sweep showed near-neighbor alignment beats raw count.

Developer Tools

Agentic Kubernetes ops get falsifiable testing with new agent-breakage framework

Q: Framework uses deterministic-embedder control and pre-registered decision matrices to enforce falsifiability; 360-run sweep showed near-neighbor alignment beats raw count?

Framework uses deterministic-embedder control and pre-registered decision matrices to enforce falsifiability; 360-run sweep showed near-neighbor alignment beats raw count.

arXiv cs.SE May 25, 2026

⚡A new open-source framework caught a +19% selection bias and 3x overstated effects.

Deep Dive

The paper tackles a critical problem in autonomous Kubernetes operations: empirical claims about agent performance are largely unfalsifiable due to lack of controlled baselines, selection bias, and small sample sizes. The authors introduce agent-breakage, a measurement substrate that injects faults into a target K8s cluster, observes agent responses, and scores them on four axes against ground truth. It distinguishes framework error from reasoning error, enforces pre-registered decision matrices, and supports a true off-condition control via a deterministic embedder—mimicking the verification substrate that code agents have.

Using this framework, the researchers tested whether retrieval over past postmortems compounds an agent's capability. The methodology caught three critical confounds that would have led to wrong published claims: a pgvector index bug, a +19% selection-bias artifact, and small-sample estimates that overstated effects by roughly 3x. The retrieval result itself was a partial falsification: only 1 of 3 dense-corpus scenarios was significant at p<0.05, with a pooled effect of +3.9 percentage points that didn't hold at n=60. A within-scenario sweep at 360 runs showed that mechanistic alignment of near-neighbors dominates raw count. The framework is released open source at version 0.1.0 under Apache 2.0.

Key Points

Three confounds caught by the substrate: a pgvector index bug, a +19% selection-bias artifact, and 3x overstated small-sample estimates.
Retrieval compounding result: only 1 of 3 scenarios significant (p<0.05), pooled effect +3.9 pp, not significant at n=60.
Framework uses deterministic-embedder control and pre-registered decision matrices to enforce falsifiability; 360-run sweep showed near-neighbor alignment beats raw count.

Why It Matters

Brings falsifiability to agentic Kubernetes ops, preventing misleading claims and improving reliability of autonomous infrastructure.

Read Original Article

Agentic Kubernetes ops get falsifiable testing with new agent-breakage framework

Why It Matters

Related Articles

🚀 Stay Ahead in AI