Research & Papers

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

New system bypasses kernel page cache to speed edge AI inference by up to 42.4%

Deep Dive

Deploying large language models (LLMs) on edge AI systems is increasingly common, but these devices often lack the GPU memory needed to hold the full key-value (KV) cache—a critical component that grows with sequence length. Existing NVMe-based offloading solutions rely on the kernel page cache, which introduces cache thrashing, unpredictable latency, and high software overhead under memory pressure. Enter DUAL-BLADE, a novel framework developed by researchers at Sogang University and Auburn University that takes a smarter, dual-path approach.

DUAL-BLADE dynamically assigns KV tensors to either a traditional page-cache path or a new NVMe-direct path based on real-time memory availability. The NVMe-direct path maps tensors to contiguous logical block address (LBA) regions, bypassing the filesystem entirely for low-overhead storage access. It also incorporates adaptive pipeline parallelism to overlap storage I/O with GPU DMA, boosting inference throughput. In evaluations, DUAL-BLADE reduced prefill latency by up to 33.1% and decode latency by up to 42.4%, while improving SSD utilization by 2.2x across diverse memory budgets. The paper will appear at IEEE ICDCS 2026.

Key Points
  • DUAL-BLADE dynamically routes KV tensors to a page-cache path or a direct NVMe path based on memory availability
  • NVMe-direct path bypasses filesystem by mapping tensors to contiguous LBA regions, eliminating kernel overhead
  • Reduces prefill latency by 33.1% and decode latency by 42.4%, with 2.2x better SSD utilization

Why It Matters

Enables faster, cheaper LLM inference on edge devices by slashing I/O bottlenecks and memory pressure.