Open Source

benchmarks of gemma4 and multiple others on Raspberry Pi5

r/LocalLLaMA April 06, 2026

⚡A $100 Raspberry Pi 5 with SSD swap runs 30B parameter models at usable speeds thanks to PCIe Gen3 optimization.

Deep Dive

A recent viral benchmark demonstrates the surprising capability of a standard Raspberry Pi 5 for running local AI models. Using a 16GB Pi 5 with an official M.2 HAT and a 1TB SSD configured for PCIe Gen3, the user achieved read speeds of ~800 MB/sec. This hardware, costing roughly $100 for the Pi plus storage, was then used to benchmark a wide range of quantized GGUF models via llama.cpp, testing both prompt processing (pp512) and text generation (tg128) at 0 and 32k context lengths.

The standout performer was Google's Gemma4 2B-it model in Q8_0 quantization (4.69 GiB), which processed prompts at a blazing 41.76 tokens per second. Even larger models like the 30B-parameter GLM-4.7-Flash and Qwen3-Coder 30B were runnable, albeit at slower speeds. The key finding is that by simply enabling PCIe Gen3 in the config file, performance for models running from SSD swap saw a 1.5x to 2x improvement over the previous USB3 setup, making inference on models far larger than the Pi's RAM practically feasible.

Key Points

Gemma4 2B-it (Q8_0) led benchmarks with 41.76 tokens/sec prompt processing speed on a Pi 5.
Enabling PCIe Gen3 on the official M.2 HAT doubled SSD read speed to ~800 MB/sec, boosting inference 1.5x-2x.
The setup successfully ran models up to 30B parameters (like GLM-4.7-Flash) using llama.cpp and SSD swap space.

Why It Matters

This proves ultra-low-cost, accessible hardware can now run powerful local AI, democratizing development and private inference.

Read Original Article

benchmarks of gemma4 and multiple others on Raspberry Pi5

Why It Matters

Stay Ahead in AI