Open Source

Best Qwen3.5-35B-A3B GGUF for 24GB VRAM?!

r/LocalLLaMA February 26, 2026

⚡A novel GGUF quantization mix uses only legacy types for potentially faster inference on AMD and Mac hardware.

Deep Dive

A developer known as ubergarm has created and released a novel quantization of Alibaba's Qwen3.5-35B-A3B language model, packaged in the widely compatible GGUF format. The key innovation is the quantization strategy: instead of using newer, more complex methods, this mix employs only legacy llama.cpp quantization types—specifically Q4_0, Q4_1, and Q8_0. The creator's hypothesis is that these older types benefit from more optimized computational kernels, particularly on the Vulkan and ROCm backends used by AMD graphics cards. The resulting model file is 19.776 GB with a density of 4.901 bits-per-weight (BPW), making it a candidate for systems with around 24GB of VRAM, like high-end consumer GPUs.

The release targets two primary hardware ecosystems often underserved by AI optimizations: AMD GPUs and Apple Silicon Macs. For AMD users with cards like the Radeon 7900 XTX, the legacy quant types could translate to significantly faster inference speeds in applications like koboldcpp. For Mac users, the mix raises the question of whether it outperforms Apple's native MLX framework for local deployment. The model is fully compatible with mainline llama.cpp and its derivatives, inviting the community to benchmark its perplexity and speed against other leading quants. This work represents the ongoing grassroots effort to squeeze maximum performance from large models on consumer hardware, providing a practical tool for developers and enthusiasts.

Key Points

Uses only legacy Q4_0/Q4_1/Q8_0 quantization types for potentially faster kernels on Vulkan/ROCm.
Results in a 19.8 GB file (4.9 BPW) targeting systems with ~24GB of VRAM.
Aims to boost performance for AMD GPU users and could offer benefits for Apple Silicon Macs.

Why It Matters

Provides a highly optimized, efficient version of a powerful 35B-parameter model for running locally on consumer-grade AMD and Apple hardware.

Read Original Article

Best Qwen3.5-35B-A3B GGUF for 24GB VRAM?!

Why It Matters

Stay Ahead in AI