Pruning to last 5 tool calls + summarization achieved 91.6% complete itemization (vs 71% full history) on GPT-5?

Pruning to last 5 tool calls + summarization achieved 91.6% complete itemization (vs 71% full history) on GPT-5.

Token consumption dropped 63% (from 1.48M to 553k) and runtime fell 60% (from 14.56 to 5.79 hours)?

Token consumption dropped 63% (from 1.48M to 553k) and runtime fell 60% (from 14.56 to 5.79 hours).

Method generalized to Claude Sonnet 4.5, confirming cross-model applicability for enterprise workflows?

Method generalized to Claude Sonnet 4.5, confirming cross-model applicability for enterprise workflows.

Research & Papers

Pruning context boosts LLM agent accuracy by 20% and slashes costs

arXiv cs.AI June 10, 2026

⚡Less context isn't just cheaper—it makes GPT-5 agents 20% more reliable.

Deep Dive

A new arXiv paper from Microsoft researchers introduces efficient context engineering for long-horizon tool-using LLM agents. The team tested four context strategies on GPT-5 and Claude Sonnet 4.5 using a 50-task hotel expense benchmark in Microsoft Dynamics 365 Finance & Operations. The baseline (no user model) achieved only 8.0% complete itemization. Full conversation history improved to 71.0% but consumed 1,480,996 tokens and 14.56 hours per benchmark run. Pruning to the last five tool call/response pairs raised completion to 79.0% while cutting tokens to 535,274 and runtime to 5.39 hours. The best performance came from adding automated summarization after pruning: 91.6% complete itemization, 99.64% average amount itemized, using only 553,374 tokens (63% less than full history) and 5.79 hours (60% faster).

The study also provides confidence intervals, effect-size analysis, sensitivity over pruning and summary windows, and failure analysis across five expense types. Crucially, the approach generalized: Claude Sonnet 4.5 showed similar gains, confirming the robustness of selective retention plus compact summarization for enterprise tool-use workflows. The key insight is that verbose tool responses from enterprise systems cause context overflow, stale-state errors, and high inference cost. By discarding redundant historical interactions and summarizing recent ones, agents become both more accurate and much cheaper to run. This has immediate implications for any organization deploying LLMs for automated financial workflows, customer support, or any long-horizon task involving multiple API calls.

Key Points

Pruning to last 5 tool calls + summarization achieved 91.6% complete itemization (vs 71% full history) on GPT-5.
Token consumption dropped 63% (from 1.48M to 553k) and runtime fell 60% (from 14.56 to 5.79 hours).
Method generalized to Claude Sonnet 4.5, confirming cross-model applicability for enterprise workflows.

Why It Matters

Smarter context management can simultaneously boost AI reliability and slash costs for enterprise automation.

Read Original Article

Pruning context boosts LLM agent accuracy by 20% and slashes costs

Why It Matters

Related Articles

🚀 Stay Ahead in AI