Quantization-Aware Training in TorchAO (II)
New QAT flow recovers up to 71.6% accuracy loss, integrates with Unsloth and Axolotl for faster fine-tuning.
Meta's PyTorch team has significantly expanded TorchAO's Quantization-Aware Training (QAT) capabilities in version 0.16.0, moving beyond edge device targeting to support fast CUDA kernels for server inference. The framework now integrates directly with popular fine-tuning tools Unsloth and Axolotl, enabling developers to recover substantial accuracy losses from aggressive post-training quantization. This allows models to be compressed to lower bit-widths like INT4 and NVFP4 while maintaining performance, a critical advancement for deploying large language models efficiently.
The technical breakthroughs include recovering 66.9% accuracy degradation with INT4 QAT on Gemma3-4B, achieving 1.73x inference speedups over BF16 baselines. For cutting-edge hardware, NVFP4 QAT on B200 GPUs recovers 71.6% accuracy loss while using only a quarter of the HBM memory. The update also introduces PARQ, a prototype optimizer-based technique for 3-bit quantization that matches 4-bit accuracy with 58% memory footprint and 1.57x faster decoding. Combined with LoRA, QAT speeds up training by 1.89x and reduces memory by 36.1%, making high-performance model compression accessible through simple API calls.
- INT4 QAT with Unsloth recovers 66.9% accuracy loss for Gemma3-4B, delivering 1.73x inference speedup
- NVFP4 QAT prototype with Axolotl recovers 71.6% accuracy degradation while using 1/4th HBM memory on B200 GPUs
- New PARQ technique enables 3-bit models with accuracy matching 4-bit baselines at 58% memory footprint and 1.57x faster throughput
Why It Matters
Enables production deployment of high-accuracy LLMs on resource-constrained devices and servers, dramatically reducing inference costs.