Traversal-as-Policy: Log-Distilled Gated Behavior Trees as Externalized, Verifiable Policies for Safe, Robust, and Efficient Agents
New 'Traversal-as-Policy' method boosts agent success from 34.6% to 73.6% while cutting token usage by 40%.
A research team has introduced 'Traversal-as-Policy,' a novel framework that addresses the core reliability and safety issues plaguing current autonomous AI agents. Instead of relying on the implicit, unverifiable policies embedded within large language model (LLM) weights and chat transcripts, the method distills successful and failed execution logs from a sandboxed environment into a single, executable Gated Behavior Tree (GBT). Each node in this tree represents a verified, state-conditioned action 'macro' mined from past successes, while safety gates derived from unsafe traces are attached to prevent the same mistakes. This externalizes the agent's decision-making logic into a transparent, inspectable structure.
The key innovation is treating the traversal of this pre-verified tree—not unconstrained LLM generation—as the primary control policy. At runtime, a lightweight component matches the LLM's intent to a safe child node, executes one macro at a time under strict gating, and uses a risk-aware recovery system if stalled. This replaces the typical, costly, and error-prone method of replaying entire conversation histories. The results are dramatic: on the challenging SWE-bench Verified software engineering task, the GBT method (GBT-SE) more than doubled the success rate from 34.6% to 73.6%, nearly eliminated safety violations (0.2% vs. 2.8%), and reduced token and character usage by approximately 40%.
Furthermore, the distilled GBT acts as a powerful, transferable policy. When the same verified tree was used with a much smaller 8B-parameter model as the executor, it more than doubled success rates on both SWE-bench Verified (14.0% to 58.8%) and the WebArena benchmark (9.1% to 37.3%). This demonstrates that the framework's robustness comes from the verifiable tree structure itself, not just raw model scale, enabling efficient and safe deployment.
- Doubled success rate on SWE-bench from 34.6% to 73.6% while slashing safety violations from 2.8% to 0.2%.
- Reduced computational cost by ~40%, cutting token usage from 208k to 126k per task.
- Enabled smaller 8B-parameter models to double their performance, achieving 58.8% success on SWE-bench.
Why It Matters
This makes autonomous AI agents fundamentally safer, more reliable, and cheaper to run, moving control from opaque models to verifiable, auditable systems.