New Paper Proposes Backchaining Method for AI Loss of Control in National Security
Deployers can use mission-specific benchmarks to identify and block AI failure paths today.
A new research paper from Matteo Pistillo, Samantha Faraone, and Joshua Herman (arXiv:2605.21095) proposes a practical methodology for mitigating Loss of Control (LoC) threats in high-stakes national security deployments of AI systems. The authors argue that while approaches like structured threat modeling, pre-deployment evaluations, and continuous monitoring are valuable, they often lack immediate, empirical grounding. Instead, the team introduces a three-step backchaining process that leverages the AI's own errors on mission-specific benchmarks to identify which affordances and permissions—system capabilities that could be misused—should be restricted.
The process works as follows: First, deployers evaluate an AI system on benchmarks that closely approximate real-world use cases in defense or intelligence. Second, instead of just scoring accuracy, they focus on the incorrect responses, analyzing what actions the AI might pursue if it acted on those wrong answers. They then backchain to determine which affordances (e.g., ability to access a database) and permissions (e.g., write access) would enable downstream harm. Third, they selectively restrict those specific affordances and permissions, bottlenecking harm pathways while preserving the AI's ability to execute correct actions. The paper demonstrates this with a derivative security classification example, showing how an error on a classification question could lead to harmful data access, and how targeted permission revocation can prevent it without breaking legitimate use.
- Methodology uses three steps: evaluate on mission benchmarks, backchain errors to harmful affordances, then selectively restrict permissions.
- Focuses on practical, evidence-based mitigation using benchmarks deployers already have, not theoretical threat models.
- Illustrated with a derivative security classification example, showing how a wrong classification could enable unauthorized data access.
Why It Matters
Gives defense and intelligence deployers a concrete, empirical method to reduce AI loss-of-control risks starting today.