benchmarks of gemma4 and multiple others on Raspberry Pi5
A $100 Raspberry Pi 5 with SSD swap runs 30B parameter models at usable speeds thanks to PCIe Gen3 optimization.
A recent viral benchmark demonstrates the surprising capability of a standard Raspberry Pi 5 for running local AI models. Using a 16GB Pi 5 with an official M.2 HAT and a 1TB SSD configured for PCIe Gen3, the user achieved read speeds of ~800 MB/sec. This hardware, costing roughly $100 for the Pi plus storage, was then used to benchmark a wide range of quantized GGUF models via llama.cpp, testing both prompt processing (pp512) and text generation (tg128) at 0 and 32k context lengths.
The standout performer was Google's Gemma4 2B-it model in Q8_0 quantization (4.69 GiB), which processed prompts at a blazing 41.76 tokens per second. Even larger models like the 30B-parameter GLM-4.7-Flash and Qwen3-Coder 30B were runnable, albeit at slower speeds. The key finding is that by simply enabling PCIe Gen3 in the config file, performance for models running from SSD swap saw a 1.5x to 2x improvement over the previous USB3 setup, making inference on models far larger than the Pi's RAM practically feasible.
- Gemma4 2B-it (Q8_0) led benchmarks with 41.76 tokens/sec prompt processing speed on a Pi 5.
- Enabling PCIe Gen3 on the official M.2 HAT doubled SSD read speed to ~800 MB/sec, boosting inference 1.5x-2x.
- The setup successfully ran models up to 30B parameters (like GLM-4.7-Flash) using llama.cpp and SSD swap space.
Why It Matters
This proves ultra-low-cost, accessible hardware can now run powerful local AI, democratizing development and private inference.