Research & Papers

Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers

Research shows naive benchmarks overestimate WebGPU dispatch costs by 20x, with real overhead at 24-71μs per operation.

Deep Dive

A comprehensive study by researcher Jędrzej Maczan reveals critical insights about WebGPU performance for running large language models directly in web browsers. The paper systematically characterizes dispatch overhead—the time cost of sending work to the GPU—across four major GPU vendors (NVIDIA, AMD, Apple, Intel), three browser backends (Chrome, Safari, Firefox), and two model sizes (Qwen2.5-0.5B and 1.5B). The most striking finding is that common, naive benchmarking methods dramatically overestimate the true cost of WebGPU's security-focused validation by approximately 20 times.

Using a novel sequential-dispatch methodology, the research pinpoints the real per-dispatch API overhead at 24-36 microseconds on Vulkan and 32-71 microseconds on Metal. When including Python's own processing costs, the total per-operation overhead rises to about 95 microseconds. This distinction proves crucial for optimization: on Vulkan, fusing multiple GPU operations into a single dispatch improves throughput by 53%, while the same technique provides no benefit on CUDA, highlighting that per-operation overhead is the primary performance differentiator in WebGPU.

The study also introduces torch-webgpu, an experimental out-of-tree PyTorch backend and compiler that allows AI models to run via WebGPU. On the reference test platform, this backend achieved 11-12% of the performance of NVIDIA's native CUDA platform. Performance analysis showed that at batch size 1 with current dispatch-heavy pipelines, per-operation overhead dominates total execution time regardless of kernel quality. All code, benchmarks, and raw data from the study are open source.

Key Points
  • Naive benchmarks overestimate WebGPU dispatch overhead by ~20x, with real cost at 24-71μs per operation
  • Kernel fusion improves Vulkan throughput by 53%, confirming overhead as primary performance bottleneck
  • Experimental torch-webgpu PyTorch backend achieves 11-12% of CUDA performance on reference hardware

Why It Matters

Enables more accurate performance predictions for browser-based AI applications and guides optimization efforts for WebGPU frameworks.