Open Source

ik_llama.cpp boosts Qwen3.6 35B inference 23% on 12GB RTX 4070

From 89 to 110 tok/s on a 12GB GPU with a 35B model? Yes.

Deep Dive

A new fork of llama.cpp called ik_llama.cpp is turning heads by delivering a 23% speedup for large language model inference on consumer GPUs. The benchmark, run on an RTX 4070 Super 12GB with an AMD Ryzen 7 9700X and 48GB RAM, compared ik_llama.cpp against the official llama.cpp build using the Qwen3.6-35B-A3B model quantized to IQ4_XS (4.19 bpw). The official llama.cpp had initially shown promising MTP (multi-token prediction) performance, but recent merges tanked throughput. ik_llama.cpp, optimized specifically for CPU offloading, restored — and exceeded — those gains.

On a suite of nine tasks (code, summarization, math, etc.), ik_llama.cpp averaged 110.24 tokens per second versus 89.76 tok/s on standard llama.cpp — a 23% increase. Peak speeds hit 122.3 tok/s on summarization and 110.3 tok/s for C++ code generation. The model size (35B total parameters, ~3.6B active per step) fit comfortably in the 12GB VRAM thanks to the IQ4_XS quant, which matches the accuracy of the larger Q4_K_XL quant while saving 4GB. These results, obtained at temperature 0.0 for reproducibility, show that with the right optimizations, even mid-range GPUs can run 30B+ models at interactive speeds.

Key Points
  • ik_llama.cpp averaged 110.24 tok/s vs 89.76 tok/s on official llama.cpp — a 23% improvement on an RTX 4070 Super 12GB.
  • The Qwen3.6 35B A3B model used IQ4_XS quant (4.19 bpw), which matches Q4_K_XL accuracy but is 4GB smaller.
  • MTP (multi-token prediction) accept rates exceeded 87% across all tasks, with peak throughput of 122.3 tok/s on summarization.

Why It Matters

23% faster inference on consumer GPUs makes large local AI models practical for real-time use.