Open Source

Qwen 3.5 35B MoE - 100k Context 40+ TPS on RTX 5060 Ti (16GB)

r/LocalLLaMA February 26, 2026

⚡New MoE model achieves 41 tokens/sec generation on a $500 GPU with 100,000 token context window.

Deep Dive

Alibaba's Qwen research team has released performance benchmarks for their Qwen 3.5 35B MoE (Mixture of Experts) model running on consumer-grade hardware, demonstrating that large-context AI inference is becoming increasingly accessible. The model achieved 41.35 tokens per second (TPS) during generation with a massive 100,000 token context window on an NVIDIA GeForce RTX 5060 Ti with just 16GB of VRAM, using the llama.cpp server with Vulkan and CUDA backends. This represents a significant milestone in making powerful AI models practical for local deployment without requiring expensive enterprise-grade hardware.

The technical setup used llama-server.exe with flash attention enabled, 40 layers offloaded to GPU, and continuous batching optimizations. In the "Treasure Island" benchmark test with 99,961 tokens, the system maintained 35.14 TPS generation speed while processing prompts at 1,154 TPS. The MoE architecture, which uses specialized sub-networks (experts) for different tasks, allows the 35B parameter model to achieve performance comparable to larger dense models while being more efficient to run. This development suggests that high-performance AI with extensive context capabilities is becoming feasible for developers and researchers with modest hardware budgets.

Key Points

Achieves 41.35 TPS generation speed with 100k context window on RTX 5060 Ti (16GB)
Uses Mixture of Experts (MoE) architecture for efficient 35B parameter model performance
Demonstrates 1,154 TPS prompt processing in llama.cpp with flash attention and continuous batching

Why It Matters

Makes large-context AI models practical for local deployment on affordable consumer hardware, lowering barriers for developers.

Read Original Article

Qwen 3.5 35B MoE - 100k Context 40+ TPS on RTX 5060 Ti (16GB)

Why It Matters

Stay Ahead in AI