Semantic Early-Stopping Slashes 38% Tokens in LLM Agent Loops Without Quality Loss
Stop wasting tokens on easy problems—cut costs while keeping quality.
Sahil Shrivastava's paper 'Semantic Early-Stopping for Iterative LLM Agent Loops' tackles a wasteful common practice: running multi-agent LLM loops (like a Writer drafting and a Critic revising) with a fixed iteration cap. This syntactic kill-switch overspends tokens on simple inputs and truncates difficult ones. The proposed solution halts the loop based on meaning: it tracks when consecutive draft embeddings stop changing (cosine distance with a patience window) and whether answer quality stops improving.
The paper makes three contributions. First, a rigorous theoretical foundation: deterministic termination is machine-checked, treating convergence as an empirically tested conjecture rather than a forced Banach contraction. Second, a judge-efficient evaluation protocol that replays all stopping policies over identical drafts and caches LLM-judge calls for low-cost paired comparisons. Third, empirical results on HotpotQA show a judge-free semantic stopper reduces operational tokens by 38% relative to max_iterations with no quality loss (Delta-IS = -0.004, p=0.81). Ironically, adding a quality gate makes things worse because per-round judging costs dominate. An oracle that picks the best round achieves +0.115 Information Score over any practical policy (p ~ 4e-11), suggesting the real challenge isn't when to stop but which round is best.
- Semantic early-stopping uses cosine distance on embeddings to detect convergence, replacing blind iteration caps.
- Judge-free variant cuts operational tokens by 38% on HotpotQA while maintaining answer quality (p=0.81).
- Quality-gated variant is counter-productive; an oracle selecting the best round outperforms all practical policies by +0.115 Information Score.
Why It Matters
Smarter stopping in AI agent loops could dramatically cut inference costs for complex multi-step tasks.