Open Source

vLLM Just Merged TurboQuant Fix for Qwen 3.5+

r/LocalLLaMA May 05, 2026

⚡The 'Not Implemented' error on Mamba layers is now history, enabling 4-bit KV cache quantization.

Deep Dive

The open-source inference engine vLLM has merged a critical fix for its TurboQuant quantization feature, resolving a 'Not Implemented' error that occurred when running Qwen 3.5+ models. The error was specific to Mamba layers, a state-space model architecture used in recent Qwen releases. The patch, linked to pull request #39931, was contributed by a community member and has been validated working with Qwen 3.6 at 27B parameters.

Users can now enable quantization via the --kv-cache-dtype argument, with several options: turboquant_4bit_nc (4-bit non-composite), turboquant_k8v4 (8-bit key, 4-bit value), turboquant_k3v4_nc (mixed 3/4 bit), and turboquant_3bit_nc (3-bit). For chunked prefill mode, a workaround addresses a mamba alignment error by raising --max-num-batched-tokens to 4096. This fix unlocks significant memory savings for deploying large Qwen models in production, making 27B+ parameter inference feasible on consumer hardware.

Key Points

Fix resolves 'Not Implemented' error on Mamba layers for Qwen 3.5+ models via vLLM PR #39931
TurboQuant supports multiple KV cache dtypes: 4-bit, 3-bit, and hybrid (turboquant_k8v4, etc.)
Chunked prefill requires --max-num-batched-tokens 4096 to avoid mamba alignment errors

Why It Matters

Enables efficient 4-bit inference for large Qwen models, reducing memory costs for production deployments.

Read Original Article

vLLM Just Merged TurboQuant Fix for Qwen 3.5+

Why It Matters

Stay Ahead in AI