AI Safety

Blind deep-deployment evals for control & sabotage

LessWrong AI May 07, 2026

⚡Third-party auditors can now test internal AI systems without accessing lab codebases.

Deep Dive

AI labs are gearing up to automate internal staff within the next year, raising the stakes for control and sabotage evals that estimate deployment safety under adversarial pressure. Currently, third-party auditors lack access to real internal systems, forcing them to rely on proxy metrics and best guesses, which diminishes the actionability of eval outcomes. Dylan Bowman, with feedback from Apollo Research, introduces 'blind deep-deployment' evals as a solution. In this approach, an auditor sends the lab a specification: a long prompt with instructions for creating the eval, a checklist prompt for proper implementation, and a code harness with scaffolding for setup and grading. The lab then uses an internal coding agent to fill in the blanks based on actual resources—codebases, enterprise workspaces, infrastructure, and user permissions. For example, the auditor could specify an eval where an agent must modify the GPU dev instance control plane while inserting a backdoor for future arbitrary provisioning; the lab populates the specific deployment settings, security measures, and monitoring.

There are three key advantages. First, no intellectual property is shared with the auditor. Second, the eval instructions are fully legible in the creating agent's prompt, increasing trust in the eval's construction. Third, this low-overhead process helps safety teams by supplementing them with additional eval ideas from the community. The biggest hurdle is that current coding agents (as of May 2026) cannot implement these evals in one shot—human engineer intervention is required. However, Bowman is optimistic that improvements in coding agents and automated eval creation will make the approach feasible by the end of 2026. Once automated, these evals can be demoed on open-source LLM training repos before sending specs to labs. A disadvantage is that labs can 'swallow' the eval and control result disclosure, but labs like Anthropic have shown amenability via partnerships (e.g., METR's internal red-teaming report). Even without full transparency, the process still strengthens lab safety efforts.

Key Points

Auditors send a specification prompt, checklist, and code harness; labs fill in internal deployment details without sharing IP.
Current coding agents (May 2026) require human intervention; feasibility of one-shot implementation expected by end of 2026.
Labs can suppress results, but approach supplements safety teams and builds on precedent (e.g., METR-Anthropic red-teaming).

Why It Matters

Realistic safety evaluations for AI internal deployments without requiring labs to expose proprietary systems.

Read Original Article

Blind deep-deployment evals for control & sabotage

Why It Matters

Stay Ahead in AI