Can't replicate Reddit numbers with Qwen 27B on a 3090TI.
Qwen 3.6 27B's hybrid SSM architecture exposes CPU limits on older hardware
A developer testing Alibaba's Qwen 3.6 27B model on a GeForce RTX 3090 Ti with an older i9-9900K CPU (released 2018) documented a stark performance discrepancy. While other users reported 30-100+ tokens per second (tok/s) using speculative decoding or optimized setups, their own results plateaued at just 18-19 tok/s with a 50k context window.
The root cause lies in Qwen 3.6's hybrid architecture combining Transformer and State Space Model (SSM) components. During generation, the SSM recurrence state update must be computed on the CPU—specifically in a 552 MiB buffer—before synchronization back to the GPU. This process is sequential and cannot be offloaded to the GPU or parallelized. The user's i9-9900K lacks AVX-VNNI/512 support (introduced with Ice Lake/Alder Lake), forcing the CPU to rely on slower AVX2/FMA paths. Even with weights fully resident in VRAM (2020 MiB CUDA buffer), the CPU becomes the bottleneck, capping performance at ~19 tok/s. Newer CPUs with AVX-VNNI could push throughput higher, but the architecture fundamentally ties generation speed to CPU capabilities.
- Qwen 3.6 27B's hybrid SSM architecture requires CPU-side state updates (~552 MiB buffer per token), limiting generation speed on older CPUs like the i9-9900K.
- Performance caps at ~18-19 tok/s on a 3090Ti with a 2018 i9-9900K, while newer CPUs with AVX-VNNI/512 (e.g., Ice Lake+) achieve 30-100+ tok/s.
- Weights are GPU-resident (2020 MiB), but computation bottlenecks occur during sequential CPU SSM updates, making this a fundamental architectural constraint.
Why It Matters
Highlights how hybrid AI architectures can bottleneck on older hardware, forcing CPU-GPU syncs that cripple performance.