Research & Papers

KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving

New research tackles the exploding energy costs of AI agents that use tools and maintain memory across conversations.

Deep Dive

A team of researchers from the University of Michigan and Google has published a paper introducing KAIROS, a novel system designed to tackle the massive and growing power consumption problem in AI inference, specifically for the emerging class of 'agentic' AI workloads. Unlike single-turn chatbots, AI agents perform complex, multi-step tasks—like writing code or analyzing data—by using tools (APIs, code execution) and maintaining a persistent memory or 'context' across an entire conversation. The paper argues that existing power-saving techniques, which primarily focus on adjusting GPU clock speeds, fail catastrophically for these agents. Lowering frequency can create memory pressure that leads to 'thrashing,' where the system spends more time managing memory than computing, drastically hurting both performance and efficiency.

KAIROS rethinks power optimization by treating the agent's evolving context as a first-class signal. The system tracks individual agents, monitoring how their memory footprint grows as they progress through a task and use different tools. It then makes joint, adaptive decisions across three levers: per-GPU frequency scaling, how many agent requests run concurrently on a single GPU instance, and intelligent routing of agents across a cluster of GPU servers. This context-aware approach allows KAIROS to aggressively save power when an agent's memory usage is low and stability is high, while proactively avoiding the thrashing regime. In evaluations across diverse software and data engineering agentic tasks, KAIROS demonstrated its effectiveness, achieving an average power reduction of 27% and up to 39.8%, all while consistently meeting predefined latency and performance service-level objectives (SLOs).

Key Points
  • Targets agentic AI workloads, where AI maintains context and uses tools across multiple turns, a major shift from single-turn LLM serving.
  • Uses agent context growth as a control signal to jointly optimize GPU frequency, concurrency, and cluster routing, avoiding performance-killing memory thrashing.
  • Achieves an average of 27% (up to 39.8%) power reduction while meeting performance targets on practical software and data engineering tasks.

Why It Matters

As AI agents become ubiquitous, this research provides a critical path to managing their prohibitive energy costs and environmental impact at scale.