LlamaWeb boosts browser LLM speed 45-69%, cuts memory 29-33%
New WebGPU backend runs large language models directly in your browser with dramatic efficiency gains.
Researchers built LlamaWeb, a WebGPU backend for an existing LLM framework that enables memory-efficient LLM inference in browsers. It reduces memory overhead by 29-33% across combinations of device, browser, and operating system, and boosts decode throughput by 45-69% across four GPUs from separate vendors. Tested on 10 models and 4 weight formats, it uses static memory planning and templated GPU kernels to support multiple quantization formats. LlamaWeb also competes with vendor-specific backends, sometimes beating their performance on some devices.
- 29-33% less memory usage across 16 devices from 8 vendors compared to existing browser LLM frameworks
- 45-69% higher decode throughput on four different GPUs from separate vendors
- Supports 4 model weight formats via templated GPU kernels, easily extensible to new quantization schemes
Why It Matters
Enables powerful, private AI applications directly in the browser without server costs or data leaks.