Qwen 3.5 35B MoE - 100k Context 40+ TPS on RTX 5060 Ti (16GB)
New MoE model achieves 41 tokens/sec generation on a $500 GPU with 100,000 token context window.
Alibaba's Qwen research team has released performance benchmarks for their Qwen 3.5 35B MoE (Mixture of Experts) model running on consumer-grade hardware, demonstrating that large-context AI inference is becoming increasingly accessible. The model achieved 41.35 tokens per second (TPS) during generation with a massive 100,000 token context window on an NVIDIA GeForce RTX 5060 Ti with just 16GB of VRAM, using the llama.cpp server with Vulkan and CUDA backends. This represents a significant milestone in making powerful AI models practical for local deployment without requiring expensive enterprise-grade hardware.
The technical setup used llama-server.exe with flash attention enabled, 40 layers offloaded to GPU, and continuous batching optimizations. In the "Treasure Island" benchmark test with 99,961 tokens, the system maintained 35.14 TPS generation speed while processing prompts at 1,154 TPS. The MoE architecture, which uses specialized sub-networks (experts) for different tasks, allows the 35B parameter model to achieve performance comparable to larger dense models while being more efficient to run. This development suggests that high-performance AI with extensive context capabilities is becoming feasible for developers and researchers with modest hardware budgets.
- Achieves 41.35 TPS generation speed with 100k context window on RTX 5060 Ti (16GB)
- Uses Mixture of Experts (MoE) architecture for efficient 35B parameter model performance
- Demonstrates 1,154 TPS prompt processing in llama.cpp with flash attention and continuous batching
Why It Matters
Makes large-context AI models practical for local deployment on affordable consumer hardware, lowering barriers for developers.