Research & Papers

HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage

New library challenges GPU memory limits with cache-driven eviction, boosting throughput 2.6-9.4x over baselines.

Deep Dive

A research team from academia and industry has introduced HierarchicalKV (HKV), a breakthrough GPU hash table library that fundamentally rethinks how GPUs handle massive datasets. Traditional GPU hash tables operate on a dictionary assumption, preserving every inserted key until failure, which wastes scarce High Bandwidth Memory (HBM) when dealing with embedding tables that exceed GPU capacity. HKV challenges this by making policy-driven eviction a first-class operation with cache semantics, where each full-bucket upsert is resolved in place through eviction or admission rejection rather than catastrophic rehashing or failure.

HKV's architecture co-designs four core mechanisms: cache-line-aligned buckets, in-line score-driven upsert, score-based dynamic dual-bucket selection, and triple-group concurrency. It also employs tiered key-value separation as a scaling enabler beyond HBM limitations. Benchmarks on an NVIDIA H100 NVL GPU show HKV achieving up to 3.9 billion key-value pairs per second (B-KV/s) find throughput with remarkable stability—less than 5% variation across load factors from 0.50 to 1.00. This represents a 1.4x improvement over WarpCore (the strongest dictionary-semantic baseline) and 2.6-9.4x gains over indirection-based GPU baselines.

Since its open-source release in October 2022, HKV has already been integrated into multiple open-source recommendation frameworks, demonstrating practical utility. The system's cache-semantic approach allows recommendation engines and AI systems to continuously store and update embeddings online without hitting GPU memory walls, enabling real-time personalization at unprecedented scales. This represents a significant advancement for database systems and distributed computing where GPU memory has been a persistent bottleneck.

Key Points
  • Achieves 3.9 billion key-value operations per second on NVIDIA H100 NVL GPU with <5% performance variation
  • Delivers 1.4x higher throughput than WarpCore and 2.6-9.4x over indirection-based GPU baselines
  • Open-source library already integrated into multiple recommendation frameworks since October 2022 release

Why It Matters

Enables real-time recommendation systems to handle embedding tables larger than GPU memory, breaking a major scalability bottleneck.