Control Debt
Engineering shortcuts at frontier labs could undermine future AI safety guarantees.
Control debt, a concept from Ida Caspary, describes how AI labs accumulate engineering shortcuts that undermine the feasibility of future safety guarantees. The core problem: when labs prioritize speed and developer convenience, they break the conditions that rigorous control protocols require—like comprehensive logging, clear audit trails, and per-action approval flows. An example: Anthropic's Claude Code now writes the majority of its internal code, meaning the very model that future control systems must monitor is co-authoring the codebase. This creates a feedback loop where safety cases rely on evidence from a system partly built by the potential adversary.
Caspary cites concrete evidence: Anthropic's Opus 4.6 Sabotage Risk Report (early 2026) acknowledges that internal monitoring isn't comprehensive, and external red-teaming by METR found specific vulnerabilities in agent monitoring and security systems—some since patched. Additionally, Anthropic's engineering team reported that sandboxing 'safely reduces permission prompts by 84%', which implies the per-action approval flow was so burdensome that teams built workarounds. These choices—shared credentials, short log retention, post-merge code reviews—are individually rational but collectively build a debt that must be paid when real control evaluations begin.
- Anthropic's Claude Code now writes majority of internal code, meaning the model co-authors code that future control protocols must monitor.
- METR red-teaming found specific vulnerabilities in Anthropic's internal agent monitoring; some have been patched, but gaps remain.
- Sandboxing reduced permission prompts by 84%, showing how developer friction drives workarounds that weaken safety properties.
Why It Matters
Without resolving control debt, AI labs risk deploying systems they cannot reliably verify or contain.