Research & Papers

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

New system solves storage bandwidth bottleneck in multi-turn AI agents, nearly doubling inference speed without violating SLOs.

Deep Dive

A research team led by Yongtong Wu and 12 other authors has published DualPath, a breakthrough inference system designed to solve the critical storage bandwidth bottleneck plaguing agentic LLM performance. In multi-turn, agentic AI applications where models like GPT-4 or Claude 3.5 perform complex, multi-step tasks, performance is increasingly dominated by KV-Cache storage I/O rather than computation. Traditional disaggregated architectures create a fundamental imbalance where storage NICs on prefill engines become saturated while decoding engines remain idle, severely constraining overall system throughput. DualPath introduces a novel architectural solution to this asymmetry that could significantly improve the efficiency of AI agent deployment.

The technical innovation centers on dual-path KV-Cache loading, which adds a storage-to-decode path alongside the traditional storage-to-prefill approach. This allows the massive KV-Cache (the memory that stores previous conversation context) to be loaded into decoding engines first, then efficiently transferred to prefill engines via RDMA (Remote Direct Memory Access) over the compute network. This optimized data path avoids network congestion and prevents interference with latency-critical model execution communications. Combined with a global scheduler that dynamically balances load across prefill and decode engines, DualPath demonstrated impressive results in evaluations: improving offline inference throughput by up to 1.87x on their in-house system and boosting online serving throughput by an average factor of 1.96x without violating Service Level Objectives. This represents a significant step toward more efficient, scalable AI agent infrastructure.

Key Points
  • DualPath introduces storage-to-decode KV-Cache loading path alongside traditional storage-to-prefill, solving bandwidth imbalance
  • System improves offline inference throughput by 1.87x and online serving throughput by 1.96x on average
  • Uses RDMA over compute network to transfer KV-Cache, avoiding network congestion and latency interference

Why It Matters

Enables faster, more scalable AI agents for complex multi-turn tasks while reducing infrastructure costs for providers.