Research & Papers

Human-Guided Harm Recovery for Computer Use Agents

A new framework teaches AI agents to clean up their own messes after causing harm on a computer.

Deep Dive

A team of researchers including Christy Li and Sky CH-Wang has published a foundational paper addressing a critical gap in AI safety for computer-use agents. As AI agents gain the ability to execute actions on real systems, the focus has been on preventing harm, but not on fixing it after it occurs. The paper formalizes this challenge as 'harm recovery'—the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences.

To ground this concept, the researchers conducted a formative user study to identify what humans value in a recovery process, creating a natural language rubric. They collected a dataset of 1,150 pairwise human judgments, revealing that preferences are context-dependent; for instance, people often favor pragmatic, targeted fixes over comprehensive long-term solutions. These insights were operationalized into a reward model that, at test time, re-ranks multiple candidate recovery plans generated by an agent scaffold.

To evaluate recovery capabilities systematically, the team introduced BackBench, a new benchmark comprising 50 diverse computer-use tasks designed to test an agent's ability to recover from intentionally harmful states. Human evaluations demonstrated that their reward model scaffold yields significantly higher-quality recovery trajectories than both base AI agents and simpler rubric-based scaffolding methods.

Together, these contributions—the formalization of harm recovery, the human-preference dataset and reward model, and the BackBench evaluation suite—lay the groundwork for a new class of post-execution safety methods. This shifts the paradigm from pure harm prevention to enabling agents to competently navigate and rectify the aftermath of their mistakes, a crucial capability for deploying autonomous agents in real-world environments.

Key Points
  • Formalizes 'harm recovery' for AI agents that act on computers, moving beyond just prevention to post-harm remediation.
  • Built a reward model from 1,150 human judgments, showing a preference for pragmatic fixes over comprehensive overhauls.
  • Introduced the BackBench benchmark with 50 tasks; their method outperformed base agents in human evaluations of recovery quality.

Why It Matters

Enables safer deployment of autonomous AI agents by giving them the ability to detect and fix their own harmful actions on user systems.