Update on Qwen 3.5 35B A3B on Raspberry PI 5
A modified llama.cpp setup achieves 3.5 t/s for a 35B parameter model on a $80 computer.
A developer has successfully optimized and run Alibaba's powerful Qwen 3.5 35B A3B model on a Raspberry Pi 5, a $80 single-board computer. By modifying the llama.cpp inference engine and using a heavily quantized 2-bit version of the model, the setup achieves a generation speed of 3.5 tokens per second on a 16GB Pi 5. This is a significant achievement for on-device AI, proving that models with tens of billions of parameters can operate outside the cloud on extremely constrained hardware.
The project involved custom tweaks to llama.cpp, blending code from the original repository and the 'ik_llama' fork, along with experimentation on quantization parameters and prompt caching. The specific model used is the 'Qwen3.5-35B-A3B-UD-Q2_K_XL.gguf' from Hugging Face. While prompt processing is still slow at ~50 seconds per 1,000 tokens, the developer is testing further optimizations like asymmetric KV cache quantization to boost performance. For comparison, a much smaller Qwen3.5 2B model in 4-bit quantization runs at a brisk 8 t/s on the same hardware.
This work pushes the boundaries of efficient inference, showing that with aggressive quantization and software optimization, the gap between cutting-edge AI and accessible, low-cost hardware is rapidly closing. It enables complex AI tasks like vision-and-language processing to be performed entirely locally on a device the size of a credit card.
- Runs Alibaba's 35B parameter Qwen 3.5 A3B model on a Raspberry Pi 5 using a modified llama.cpp engine.
- Achieves 3.5 tokens/second generation speed with a 2-bit quantized model, with prompt processing at ~50s/1k tokens.
- Demonstrates the feasibility of powerful, multi-modal AI running locally on ultra-low-cost, energy-efficient hardware.
Why It Matters
Enables complex AI applications to run offline on cheap, portable devices, reducing cost and latency for edge computing.