Research & Papers

Agentic AI Workloads Are Decode-Dominated, Not Long-Prompt

New research reveals stateful multi-turn agents shift LLM serving dynamics dramatically.

Deep Dive

A new arXiv paper by Yichao Yuan and colleagues provides the first rigorous characterization of agentic AI workloads, challenging conventional wisdom about how AI agents interact with LLM serving infrastructure. Using an end-to-end tracing infrastructure, the researchers studied ReAct-style agents running on reasoning and non-reasoning configurations of Gemma and Qwen across five standard benchmarks. The findings upend the assumption that agents simply generate long prompts: with effective context caching, most input tokens are reused across turns, meaning the bottleneck shifts from prompt processing to decode latency. This makes agentic serving heavily dependent on long-lived KV-cache state, requiring entirely different optimization strategies than traditional prompt-generation workloads.

The paper also reveals a clear temporal structure in tool usage: early in an agent's execution cycle, tools are primarily used for reading and exploration, while later stages shift toward execution and writing. This pattern suggests that serving systems must dynamically adapt resource allocation as agent sessions progress. The authors conclude that efficient agentic serving must jointly manage repeated model re-entry, persistent context state, and workload-dependent tool behavior. For engineers building production agent systems, the study provides actionable insights: don't treat agents as just long prompts; invest in KV-cache management and decoding optimization, and expect tooling costs to vary predictably over each agent's lifetime.

Key Points
  • Agentic workloads become decode-dominated with effective context caching, shifting bottleneck from prompt processing to token generation.
  • Tool use follows a predictable temporal pattern: read/explore early, execute/write later in the agent lifecycle.
  • Optimizing agent serving requires managing repeated model re-entry, persistent KV-cache state, and variable tool behavior jointly.

Why It Matters

For AI engineers building agent systems, this paper redefines infrastructure priorities — decode speed and KV-cache now matter more than prompt throughput.