Qwen3-30B-A3B vs Qwen3.5-35B-A3B on RTX 5090
The new Mixture-of-Experts model handles 32K context with no degradation but runs 35% slower on an RTX 5090.
Alibaba's Qwen team has launched the Qwen3.5-35B-A3B, an upgraded Mixture-of-Experts (MoE) model that builds on the architecture of the popular Qwen3-30B-A3B. The new model adds 5 billion total parameters (keeping 3 billion active), ships with a vision projector, and features a significantly larger vocabulary of 248K tokens. Initial benchmarking on NVIDIA's flagship RTX 5090 GPU reveals a trade-off: while the 3.5 variant generates text 32-35% slower than its predecessor in raw throughput tests, it demonstrates superior context handling with flat token-per-second scaling across its full 32K context window, compared to the 30B model's 21% performance degradation.
Technical analysis using llama.cpp with Q4_K_M quantization shows the Qwen3.5-35B-A3B uses 29GB VRAM at idle and produces slightly more verbose, structured outputs in creative and coding tasks. The performance regression appears most pronounced in long-form generation (116 tokens/sec vs 232.6 for 800-token outputs), though prompt processing sees only a modest slowdown. For developers choosing between the models, the decision hinges on prioritizing consistent long-context performance versus raw generation speed, with both models remaining viable for local deployment on high-end consumer hardware like the RTX 5090.
- Qwen3.5-35B-A3B shows 32% slower generation (153.8 vs 237.1 tokens/sec) but zero context degradation across 32K tokens
- Model uses 29GB VRAM with Q4_K_M quantization and features 248K vocabulary vs 152K in previous version
- Slightly improved output quality with more structured responses, maintaining same 3B active parameter MoE architecture
Why It Matters
Developers must choose between faster generation or better long-context handling for local AI deployment on consumer hardware.