Has anyone else tried IQ2 quantization? I'm genuinely shocked by the quality
New UD-IQ2_XXS quantization runs Qwen3-30B at 100 TPS with near Q4_K_M quality on consumer GPUs.
A viral Reddit post has highlighted a breakthrough in making large language models (LLMs) dramatically faster and more efficient to run locally. The user tested a new ultra-low-bit quantization method called UD-IQ2_XXS on the 30-billion-parameter Qwen3-30B-A3B model. Using llama.cpp with Vulkan backend on a consumer-grade AMD RX 9060 XT 16GB GPU, they achieved a staggering 100 tokens per second (TPS), a 5x speedup over the previously standard Q4_K_M quantization which ran at 20 TPS. The model's size was compressed to just 10.3 GB, allowing it to be fully loaded into the GPU's VRAM.
The most surprising result was the minimal loss in output quality. In a rigorous test against Anthropic's Claude Opus 4.6, the quantized model performed nearly identically on high school and university-level questions in chemistry, math, and physics. Only on extremely niche academic topics, like Gödel's Incompleteness Theorem, did a measurable gap appear, scoring 81/100 versus the Q4 quant's 92/100. Remarkably, in one instance, the local 10GB IQ2 model correctly answered a graph analysis question that both Claude Opus 4.6 and Claude Sonnet 4.6 got wrong.
This development is significant for the open-source AI community. IQ2 quantization, developed by the llama.cpp community, represents a major leap in efficiency. It drastically lowers the hardware barrier for running powerful 30B-parameter models, enabling high-speed, high-context (20K+) inference on consumer hardware. This could accelerate the adoption of local, private AI assistants and specialized models, moving more capability away from cloud APIs and onto personal devices.
- Achieved 100 TPS (5x faster than Q4_K_M) on Qwen3-30B using UD-IQ2_XXS quantization.
- Model size reduced to 10.3GB, enabling full GPU offload on a 16GB RX 9060 XT.
- Quality loss was minimal, scoring 81/100 vs 92/100 on the most complex academic benchmarks.
Why It Matters
Makes powerful 30B-parameter models viable on consumer GPUs, enabling fast, private local AI that rivals cloud models.