One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving
Dynamic memory allocation eliminates 20-30% unrealized latency in GenRec inference.
Generative recommender (GenRec) inference faces a critical trade-off: GPU HBM must be shared between embedding hot caches (EMB) and KV caches. Current systems optimize each in isolation, missing that the optimal allocation ratio can shift by up to 0.35 across workload regimes, leaving 20-30% latency improvement on the table. The paper introduces HELM, which jointly manages HBM allocation and request routing. Its Adaptive Memory Allocation uses a three-layer PPO-based controller: a frozen base policy, an online residual adapter, and a burst-aware recovery controller. This achieves a decision latency of just 32 microseconds while staying within 0.024-0.029 of the offline-optimal ratio. The EMB-KV-Aware Scheduler routes requests by considering KV residency, embedding locality, and node load, avoiding inefficiencies under heterogeneous allocations.
HELM was evaluated on three production-scale datasets over a 32-node A100 cluster. Against the best static baseline, HELM reduces P99 latency by 24-38% and maintains 93.5-99.6% SLO satisfaction across Steady, Trend, and Burst workloads—without sacrificing throughput. The work demonstrates that online, fine-grained memory rebalancing is not only feasible but essential for modern GenRec serving. By eliminating H2D refill traffic on the critical path, HELM avoids P99 SLO violations that plague naive reallocation approaches. This opens the door to more efficient GPU utilization in recommendation systems, which are among the most latency-sensitive AI workloads in production.
- PPO-based three-layer controller achieves 32 µs decision latency, staying within 0.024-0.029 of optimal EMB-KV ratio.
- Optimal allocation ratio can shift by up to 0.35 across workload regimes, enabling 24-38% P99 latency reduction.
- Achieves 93.5-99.6% SLO satisfaction on steady, trend, and burst workloads without throughput loss.
Why It Matters
Dynamic HBM partitioning unlocks 20-30% latent performance, enabling cost-efficient real-time GenRec serving at scale.