Open Source

ik_llama.cpp gives 26x faster prompt processing on Qwen 3.5 27B — real world numbers

r/LocalLLaMA March 22, 2026

⚡A community fork of llama.cpp achieves 1,122 tokens/sec by fixing a critical architectural bottleneck.

Deep Dive

A community-maintained fork of the popular llama.cpp inference engine is delivering massive performance gains for users of Qwen's latest 27-billion-parameter model. The 'ik_llama.cpp' fork, developed by Thireus, specifically addresses a critical architectural mismatch: Qwen 3.5 uses a hybrid Gated Delta Network (GDN) and Mamba-style state space model (SSM) architecture, but the mainline llama.cpp implementation was inefficiently splitting this computation across 34 separate graph nodes, heavily involving the CPU. The new fork implements fused GDN kernels that handle the entire operation on the GPU's CUDA cores, slashing graph splits from 34 to just 2.

Benchmark results on an NVIDIA RTX PRO 4000 GPU show the staggering impact. Prompt evaluation speed jumped from ~43 tokens/second to 1,122 tokens/second—a 26x improvement. Generation speed also saw a 3.5x boost, from ~7.5 to 26 tokens/second. This transforms workflows, especially for agentic coding where prompts change frequently. While a known 'full re-processing bug' in Qwen's recurrent architecture still forces the entire prompt to be re-evaluated on each turn, the new speed makes this tolerable, turning minute-long waits into seconds.

The fork is available as pre-built Windows binaries with CUDA 12.8 support and functions as a drop-in replacement, using the same command-line arguments and providing the same OpenAI-compatible API on port 1234. This represents a significant stopgap optimization for the open-source ecosystem, demonstrating how targeted kernel-level engineering can unlock the latent performance of novel model architectures before those fixes are merged into mainline projects.

Key Points

Achieves 26x faster prompt processing (1,122 tok/sec vs 43 tok/sec) for Qwen 3.5 27B by using fused GPU kernels.
Reduces computational graph splits from 34 to 2 by fixing the mismatch with Qwen's Gated Delta Network architecture.
Available as a drop-in replacement for mainline llama.cpp, offering the same API for immediate performance gains.

Why It Matters

This unlocks practical, high-speed local inference for advanced 27B parameter models, making agentic AI workflows viable on consumer-grade hardware.

Read Original Article

ik_llama.cpp gives 26x faster prompt processing on Qwen 3.5 27B — real world numbers

Why It Matters

Stay Ahead in AI