H100 achieves only 27% of peak HBM bandwidth for batch-1 decode, vs 81% on L4, disproving the bandwidth-bound assumption?

H100 achieves only 27% of peak HBM bandwidth for batch-1 decode, vs 81% on L4, disproving the bandwidth-bound assumption.

CUDA Graphs yields 1.259x speedup on H100 but only 1.028x on L4, isolating launch overhead as the primary bottleneck on fast GPUs?

CUDA Graphs yields 1.259x speedup on H100 but only 1.028x on L4, isolating launch overhead as the primary bottleneck on fast GPUs.

GPTQ+ExLlamaV2 int4 quantization cuts latency to 17.36 ms/step from a 62.32 ms bf16 baseline, while bnb-nf4 and AutoAWQ show minimal gains?

GPTQ+ExLlamaV2 int4 quantization cuts latency to 17.36 ms/step from a 62.32 ms bf16 baseline, while bnb-nf4 and AutoAWQ show minimal gains.

Research & Papers

New arXiv paper reveals H100s leave 73% of memory bandwidth on the table for batch-1 LLM decode

arXiv cs.DC June 01, 2026

⚡Fast GPUs like H100 suffer launch overhead, not bandwidth limits, for single-stream AI inference.

Deep Dive

A new paper from Josef Chen, published on arXiv on 28 May 2026, tackles a critical but underappreciated workload: single-stream, batch-1 autoregressive decode for physical AI systems like robots, autonomous vehicles, and edge copilots. The conventional wisdom is that such inference is memory-bandwidth-bound, meaning faster memory (like H100's 3.35 TB/s HBM3) should proportionally reduce latency. Chen tested three 7-8B parameter GQA transformers (including Qwen-2.5-7B) across four NVIDIA GPUs—H100, A100, L40S, and L4—with context lengths from 2K to 16K tokens. The results flipped the narrative: the H100 reached only 27% of its analytical memory floor, while the older L4 hit 81%. This means faster memory does not translate into faster token generation for batch-1 workloads; the bottleneck shifts to launch overhead from the GPU kernel dispatch mechanism.

To isolate the missing term, Chen ran a CUDA Graphs A/B experiment. On the H100 at 2048 context, CUDA Graphs improved decode latency by 1.259x (95% CI: 1.253–1.267), while on the L4 the same intervention yielded a meager 1.028x improvement. This confirms that launch-side overhead becomes the dominant factor on fast GPUs but is hidden on slower, bandwidth-bound ones. The deployment implications are significant: memory savings from quantization don't always materialize. On the L4, bfloat16 inference already sits near the memory floor, but 4-bit quantization via bnb-nf4 only reduces latency from 62.32 ms/step to 59.36 ms/step, and AutoAWQ+Marlin to 45.24 ms/step—far from the expected 4x reduction. Only GPTQ+ExLlamaV2's Ada-tuned int4 kernels deliver dramatic improvement to 17.36 ms/step. For engineers deploying physical AI, this means that simply buying a faster GPU or applying common quantization may yield diminishing returns; instead, optimizing kernel launch overhead and using advanced quantized kernels is essential.

Key Points

H100 achieves only 27% of peak HBM bandwidth for batch-1 decode, vs 81% on L4, disproving the bandwidth-bound assumption.
CUDA Graphs yields 1.259x speedup on H100 but only 1.028x on L4, isolating launch overhead as the primary bottleneck on fast GPUs.
GPTQ+ExLlamaV2 int4 quantization cuts latency to 17.36 ms/step from a 62.32 ms bf16 baseline, while bnb-nf4 and AutoAWQ show minimal gains.

Why It Matters

For physical AI at the edge, picking the wrong GPU or quantization method can waste money and miss latency targets.

Read Original Article

New arXiv paper reveals H100s leave 73% of memory bandwidth on the table for batch-1 LLM decode

Why It Matters

Related Articles

🚀 Stay Ahead in AI