Decoupled Attention from Weights - Gemma 4 26B
Split attention from weights to run 26B models on modest hardware.
Deep Dive
Split attention (a couple of GB) onto your local machine and the weights onto another local machine (like a cheap Xeon) to basically bypass the scale issue with local LLMs completely. Functional code is available in the larql repo.
Key Points
- Decoupled attention splits the attention cache (a few GB) from the model weights, enabling large models on limited VRAM.
- Weights are served from a cheap Xeon CPU machine while attention runs locally, reducing hardware costs dramatically.
- Functional code is available in the larql GitHub repo, with a video overview explaining the architecture.
Why It Matters
Makes 26B+ local LLMs feasible without expensive GPUs, democratizing private AI inference.