Research & Papers

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

DeepSeek-V3.2 runs 7.5% faster at 100K context using temporal correlation trick.

Deep Dive

NVIDIA researchers have introduced Guess-Verify-Refine (GVR), a novel algorithm that accelerates sparse-attention decoding on Blackwell GPUs by up to 1.88x on average and 2.42x per layer. The technique addresses a key bottleneck in long-context LLM serving: the exact Top-K selection that runs once per decode query. GVR exploits temporal correlation across consecutive decode steps—using the previous step's Top-K as a prediction signal—to narrow down valid candidates with just 1-2 global passes, then performs exact selection in shared memory.

Validated on real DeepSeek-V3.2 workloads integrated into TensorRT-LLM, GVR achieves bit-exact outputs while delivering a 7.52% end-to-end TPOT improvement in controlled TEP8 min-latency deployments at 100K context. The gains grow with longer contexts and remain positive even under speculative decoding. The algorithm leverages the Toeplitz/RoPE structure of DeepSeek Sparse Attention (DSA) indexer scores, and while currently implemented on Blackwell, the principle may extend to other sparse-attention decoders exhibiting temporal stability in decode-phase Top-K.

Key Points
  • GVR achieves 1.88x average speedup over the production radix-select kernel, up to 2.42x per layer per step
  • Exploits temporal correlation across consecutive decode steps using previous Top-K as prediction signal
  • Validated on DeepSeek-V3.2 in TensorRT-LLM, improving end-to-end TPOT by 7.52% at 100K context

Why It Matters

GVR makes long-context LLM inference faster on Blackwell, reducing latency for real-time AI applications.