DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference
Bypassing HBM bottlenecks, DAK fetches data directly into shared memory...
A new research paper introduces DAK (Direct-Access-Enabled GPU Memory Offloading), a framework designed to overcome GPU memory constraints during LLM inference. Traditional tiered memory systems rely on prefetching data into GPU HBM, which causes contention, wastes capacity, and creates pipeline inefficiencies. DAK instead enables direct GPU access to remote memory by repurposing the Tensor Memory Accelerator (TMA) to asynchronously load offloaded weights and KV caches directly into GPU shared memory (SMEM). This approach eliminates HBM bottlenecks and maximizes aggregate system bandwidth.
DAK incorporates a greedy algorithm to determine optimal per-operation offloading ratios, active congestion control, and TMA multicast to reduce interconnect bottlenecks and read amplification. Evaluations across various architectures demonstrate near-optimal bandwidth aggregation, achieving up to 3x performance gains on NVLink-C2C systems and 1.8x on PCIe systems compared to state-of-the-art baselines. The framework offers a practical solution for running large models on limited GPU memory without sacrificing inference speed.
- DAK uses TMA to fetch data directly into GPU shared memory, bypassing HBM contention.
- Achieves up to 3x performance gains on NVLink-C2C and 1.8x on PCIe systems vs. baselines.
- Includes greedy algorithm for optimal offloading ratios and active congestion control.
Why It Matters
DAK enables faster LLM inference on limited GPU memory, crucial for scaling AI deployments.