To 16GB VRAM users, plug in your old GPU
Combine a 16GB and 6GB GPU for 22GB VRAM and 19 t/s inference.
A Reddit user, akira3weet, shared a practical hack for running dense ~30B parameter LLMs on modest hardware: pair a modern 16GB VRAM GPU (e.g., RTX 5070 Ti) with an older 6GB+ card (e.g., RTX 2060) using llama-server. By leveraging a secondary PCIe x4 slot, they combined 22GB VRAM—approaching 24GB cards like the RTX 4090. The key configuration uses dev=Vulkan1,Vulkan2 to enable both GPUs, no-mmap and mlock=false to keep the model off RAM, and cache-type-k/v=q8_0 to minimize VRAM usage. With Qwen3.6-27B at Q4_K_M quantization and 128K context, they achieved 186.76 tokens/s prompt processing and 19.21 tokens/s generation, a 5x speedup over single-card inference (4 t/s). The setup uses split-mode=layer by default to distribute layers asymmetrically. This approach is ideal for enthusiasts with older cards like GTX 1060 or RTX 2060, though performance depends on PCIe bandwidth and card compatibility.
- Pair a 16GB RTX 5070 Ti with a 6GB RTX 2060 for 22GB total VRAM, enabling 30B models.
- Use llama-server with dev=Vulkan1,Vulkan2 to enable multi-GPU; no-mmap keeps model off RAM.
- Achieved 186 t/s prompt eval and 19 t/s generation at 71K context, 5x faster than single card.
Why It Matters
Repurposing old GPUs offers a cost-effective path to run large LLMs locally without expensive upgrades.