NVIDIA and Alibaba's Qwen3.6-35B-A3B-NVFP4 slashes memory 3x with 4-bit quantization
NVIDIA's 4-bit quantization cuts memory 3.06x while retaining 99% accuracy across benchmarks
NVIDIA has launched the Qwen3.6-35B-A3B-NVFP4, a post-training quantized variant of Alibaba's Qwen3.6-35B-A3B model. The original 35-billion-parameter MoE (Mixture of Experts) transformer architecture is compressed from 16-bit floating point (BF16) to the new NVFP4 4-bit data type using NVIDIA's Model Optimizer. Only the weights and activations of linear operators within the transformer MoE blocks are quantized, preserving the model's structural integrity while slashing disk size and GPU memory requirements by approximately 3.06x. This makes the model ready for efficient inference with vLLM, allowing deployment on hardware with limited memory.
Benchmark results show that the accuracy loss from quantization is minimal across a wide range of tasks. On MMLU Pro, the NVFP4 version scores 85.0 versus 85.6 in BF16; GPQA Diamond drops from 84.9 to 84.8; and AIME 2025 remains flat at 62.0. In some categories like IFBench and MMMU PRO, the quantized version actually outperforms the original. This demonstrates that aggressive 4-bit quantization can maintain near-lossless performance for large language models, enabling cost-effective deployment in production environments where GPU memory is a bottleneck. The model is available on Hugging Face for developers to test and integrate.
- Quantizes 35B-parameter MoE model from 16-bit to 4-bit NVFP4, reducing memory and disk usage by ~3.06x
- Accuracy drops are negligible—MMLU Pro goes from 85.6 to 85.0, AIME 2025 unchanged at 62.0
- NVIDIA's Model Optimizer and vLLM inference support make deployment straightforward on limited hardware
Why It Matters
Enables running a high-performing 35B-parameter model on consumer GPUs, democratizing access to advanced AI.