New arXiv paper reveals H100s leave 73% of memory bandwidth on the table for batch-1 LLM decode
Fast GPUs like H100 suffer launch overhead, not bandwidth limits, for single-stream AI inference.
A new paper from Josef Chen, published on arXiv on 28 May 2026, tackles a critical but underappreciated workload: single-stream, batch-1 autoregressive decode for physical AI systems like robots, autonomous vehicles, and edge copilots. The conventional wisdom is that such inference is memory-bandwidth-bound, meaning faster memory (like H100's 3.35 TB/s HBM3) should proportionally reduce latency. Chen tested three 7-8B parameter GQA transformers (including Qwen-2.5-7B) across four NVIDIA GPUs—H100, A100, L40S, and L4—with context lengths from 2K to 16K tokens. The results flipped the narrative: the H100 reached only 27% of its analytical memory floor, while the older L4 hit 81%. This means faster memory does not translate into faster token generation for batch-1 workloads; the bottleneck shifts to launch overhead from the GPU kernel dispatch mechanism.
To isolate the missing term, Chen ran a CUDA Graphs A/B experiment. On the H100 at 2048 context, CUDA Graphs improved decode latency by 1.259x (95% CI: 1.253–1.267), while on the L4 the same intervention yielded a meager 1.028x improvement. This confirms that launch-side overhead becomes the dominant factor on fast GPUs but is hidden on slower, bandwidth-bound ones. The deployment implications are significant: memory savings from quantization don't always materialize. On the L4, bfloat16 inference already sits near the memory floor, but 4-bit quantization via bnb-nf4 only reduces latency from 62.32 ms/step to 59.36 ms/step, and AutoAWQ+Marlin to 45.24 ms/step—far from the expected 4x reduction. Only GPTQ+ExLlamaV2's Ada-tuned int4 kernels deliver dramatic improvement to 17.36 ms/step. For engineers deploying physical AI, this means that simply buying a faster GPU or applying common quantization may yield diminishing returns; instead, optimizing kernel launch overhead and using advanced quantized kernels is essential.
- H100 achieves only 27% of peak HBM bandwidth for batch-1 decode, vs 81% on L4, disproving the bandwidth-bound assumption.
- CUDA Graphs yields 1.259x speedup on H100 but only 1.028x on L4, isolating launch overhead as the primary bottleneck on fast GPUs.
- GPTQ+ExLlamaV2 int4 quantization cuts latency to 17.36 ms/step from a 62.32 ms bf16 baseline, while bnb-nf4 and AutoAWQ show minimal gains.
Why It Matters
For physical AI at the edge, picking the wrong GPU or quantization method can waste money and miss latency targets.