Open Source

My NAS runs an 80B LLM at 18 tok/s on its iGPU. No discrete GPU. Still optimizing.

This home-built NAS setup crushes local AI inference without expensive graphics cards...

Deep Dive

A hobbyist achieved 18 tokens/second inference on an 80-billion parameter Qwen3-Coder-Next model using only integrated graphics on a custom NAS. The system combines a Ryzen AI 9 HX PRO CPU with 96GB RAM, running llama.cpp with Vulkan backend and flash attention. After optimizing from 3 tok/s (CPU-only) to 18 tok/s, it demonstrates that high-end local LLMs can run on surprisingly affordable, multi-purpose hardware without discrete GPUs.

Why It Matters

This dramatically lowers the cost barrier for running massive local AI models, making advanced inference accessible without $1,000+ GPUs.