Open Source

Benchmarked all unsloth Qwen3.5-27B Q4 models on a 3090

r/LocalLLaMA March 10, 2026

⚡Independent benchmarks reveal IQ4_NL and IQ4_XS quantized models outperform others in perplexity and token generation speed.

Deep Dive

A detailed community benchmark has put seven different 4-bit quantized versions of Unsloth's Qwen3.5-27B model head-to-head on an NVIDIA RTX 3090 GPU. The test, conducted using the llama-bench and llama-perplexity tools, measured critical performance metrics including load time, prompt evaluation speed, token generation speed, and model accuracy (perplexity) on the Wikitext dataset. The results provide a clear, data-driven comparison for developers looking to run this powerful 27-billion-parameter model locally.

The standout performers were the IQ4_NL and IQ4_XS quantized models. The IQ4_NL model delivered the best accuracy with a perplexity score of 6.9314, while the IQ4_XS variant achieved the fastest prompt processing at 1261.40 tokens per second. Notably, the IQ4_NL model outperformed the much larger UD_Q4_K_XL variant in both speed and perplexity, demonstrating that newer quantization methods can offer superior efficiency. The Q4_K_S model had the fastest load time at just over 8 seconds.

This benchmark is crucial for the open-source AI community, as quantization—reducing model precision to save memory—is essential for local deployment on consumer hardware. The data shows that the choice of quantization algorithm (IQ4, Q4, Q4_K) significantly impacts the trade-off between speed, file size, and model fidelity. For users with an RTX 3090's 24GB of VRAM, the IQ4_NL model presents an optimal balance, offering top-tier accuracy and competitive generation speeds in a 15.7GB package.

Key Points

IQ4_NL model achieved the best accuracy with a perplexity (PPL) score of 6.9314 on the Wikitext test set.
IQ4_XS model delivered the fastest prompt evaluation speed at 1261.40 tokens/second on the RTX 3090.
The Q4_K_S model loaded the fastest in 8024.94 ms, while the largest model, UD_Q4_K_XL (17.6GB), was the slowest in generation.

Why It Matters

This benchmark provides essential data for developers to choose the most efficient quantized model for local AI applications, optimizing for speed or accuracy on consumer GPUs.

Read Original Article

Benchmarked all unsloth Qwen3.5-27B Q4 models on a 3090

Why It Matters

Stay Ahead in AI