Loopzero framework tests for rising gain, recursive persistence, and declining diversity — none of the tested detectors passed under a strict false-positive budget (FP 3–7%)?

Loopzero framework tests for rising gain, recursive persistence, and declining diversity — none of the tested detectors passed under a strict false-positive budget (FP 3–7%).

Evaluated on Volmageddon 2018, COVID MWCB 2020, and MovieLens-25M replay — all standard comparators failed to reach an accepted operating point?

Evaluated on Volmageddon 2018, COVID MWCB 2020, and MovieLens-25M replay — all standard comparators failed to reach an accepted operating point.

LLM training-loop trajectories from Shumailov et al. (2024) are directionally consistent, but matched-FP evaluation in that domain is deferred?

LLM training-loop trajectories from Shumailov et al. (2024) are directionally consistent, but matched-FP evaluation in that domain is deferred.

Research & Papers

Loopzero benchmark: current recursive-collapse detectors all fail under strict false-positive control

arXiv cs.SY June 02, 2026

⚡Two canonical benchmarks, equal alert budgets — zero detectors pass the bar.

Deep Dive

David Mullett's new paper "Benchmarking Recursive-Collapse Warning Claims Under Matched False-Positive Control" introduces Loopzero, a claim-bounded benchmark framework designed to rigorously test whether recursive systems exhibit a directional telemetry pattern—rising gain (G), recursive persistence (p), and declining diversity (δ)—before they collapse. The framework is specified in Lean and evaluated on two frozen public-artifact benchmarks: a segmented public-markets benchmark (covering Volmageddon 2018 and the COVID MWCB 2020) and an offline deterministic replay of MovieLens-25M.

The critical innovation is a locked equal-false-positive contract (FP ∈ [0.03, 0.07], pre-registered) that ensures all detectors face the same alert budget. Under this constraint, neither standard comparators nor Loopzero's own pre-registered quantile detector achieved an accepted operating point. While directional witness alignment held on both canonical benchmarks, adjacent-horizon and row-level limitations are disclosed. The paper also notes that digitized trajectories from Shumailov et al. (2024) LLM training-loop simulations are directionally consistent with the pattern, but matched-FP evaluation in that domain is deferred. The contribution is a reproducible, falsifiable benchmark that reports non-acceptance as a first-class scientific outcome—a rare admission in the field of AI safety.

Key Points

Loopzero framework tests for rising gain, recursive persistence, and declining diversity — none of the tested detectors passed under a strict false-positive budget (FP 3–7%).
Evaluated on Volmageddon 2018, COVID MWCB 2020, and MovieLens-25M replay — all standard comparators failed to reach an accepted operating point.
LLM training-loop trajectories from Shumailov et al. (2024) are directionally consistent, but matched-FP evaluation in that domain is deferred.

Why It Matters

If collapse detectors can't even pass controlled benchmarks, AI systems relying on them risk catastrophic failures without warning.

Read Original Article

Loopzero benchmark: current recursive-collapse detectors all fail under strict false-positive control

Why It Matters

Related Articles

🚀 Stay Ahead in AI