Open Source

Qwen 3.6 35B A3B Q4_K_M quant evaluation

r/LocalLLaMA April 18, 2026

⚡A quantized 3B-parameter MoE model achieves solid benchmarks without needing a GPU.

Deep Dive

Alibaba's Qwen 3.6 35B A3B model, featuring a Mixture of Experts (MoE) architecture with 35 billion total parameters but only 3 billion active at any time, has demonstrated impressive efficiency in a new evaluation. The model was tested in its Q4_K_M quantized GGUF format, a compressed version that reduces memory footprint, running entirely on a CPU system with 32 virtual cores and 125GB of RAM—no GPU required. On standard benchmarks, it achieved a 74.30% score on HellaSwag, a test for commonsense reasoning, while generating output at a speed of 22 tokens per second.

This performance is notable for a model running in a quantized state on consumer-grade hardware, making advanced AI more accessible. The evaluation also covered code generation and function calling, where the model scored 47.56% on HumanEval and 46.00% on the BFCL benchmark, indicating these are more challenging tasks for this specific configuration. The entire testing pipeline was built and executed using the Neo AI Engineer tool, which automated the process of selecting the optimal quantized model version, applying the correct chat template, and running a consolidated evaluation harness across 1,264 total samples.

Key Points

The Q4_K_M quantized model scored 74.3% on HellaSwag for commonsense reasoning.
It runs at 22 tokens/second on a CPU-only system (32 vCPUs, 125GB RAM).
The evaluation was automated using the Neo AI Engineer tool across 1,264 samples.

Why It Matters

It proves efficient, capable AI models can run on standard CPU servers, lowering the barrier to deployment.

Read Original Article

Qwen 3.6 35B A3B Q4_K_M quant evaluation

Why It Matters

Stay Ahead in AI