Open Source

Qwen3-30B-A3B vs Qwen3.5-35B-A3B on RTX 5090

r/LocalLLaMA February 25, 2026

⚡The new Mixture-of-Experts model handles 32K context with no degradation but runs 35% slower on an RTX 5090.

Deep Dive

Alibaba's Qwen team has launched the Qwen3.5-35B-A3B, an upgraded Mixture-of-Experts (MoE) model that builds on the architecture of the popular Qwen3-30B-A3B. The new model adds 5 billion total parameters (keeping 3 billion active), ships with a vision projector, and features a significantly larger vocabulary of 248K tokens. Initial benchmarking on NVIDIA's flagship RTX 5090 GPU reveals a trade-off: while the 3.5 variant generates text 32-35% slower than its predecessor in raw throughput tests, it demonstrates superior context handling with flat token-per-second scaling across its full 32K context window, compared to the 30B model's 21% performance degradation.

Technical analysis using llama.cpp with Q4_K_M quantization shows the Qwen3.5-35B-A3B uses 29GB VRAM at idle and produces slightly more verbose, structured outputs in creative and coding tasks. The performance regression appears most pronounced in long-form generation (116 tokens/sec vs 232.6 for 800-token outputs), though prompt processing sees only a modest slowdown. For developers choosing between the models, the decision hinges on prioritizing consistent long-context performance versus raw generation speed, with both models remaining viable for local deployment on high-end consumer hardware like the RTX 5090.

Key Points

Qwen3.5-35B-A3B shows 32% slower generation (153.8 vs 237.1 tokens/sec) but zero context degradation across 32K tokens
Model uses 29GB VRAM with Q4_K_M quantization and features 248K vocabulary vs 152K in previous version
Slightly improved output quality with more structured responses, maintaining same 3B active parameter MoE architecture

Why It Matters

Developers must choose between faster generation or better long-context handling for local AI deployment on consumer hardware.

Read Original Article

Qwen3-30B-A3B vs Qwen3.5-35B-A3B on RTX 5090

Why It Matters

Stay Ahead in AI