Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]
After FP16 and pruning, 162MB model needs INT8/INT4 or distillation for further gains.
A developer working on optimizing a transformer-based neural network for inference speed and model size has hit a plateau after converting weights to FP16 (2x size reduction) and exporting with ONNX Runtime. Attempts with unstructured and structured pruning, as well as ONNX graph optimizations, failed to yield significant gains, leaving the model at ~162MB. The developer is now considering next steps like low-rank factorization (SVD/LoRA-style compression), more aggressive quantization (INT8/INT4 via GPTQ, AWQ, or SmoothQuant), knowledge distillation into a smaller student model, or hardware-specific optimizations (TensorRT, FlashAttention).
Community feedback suggests that after FP16 and pruning, low-rank methods (SVD/LoRA) are often ineffective post-training, with quantization and distillation being the primary paths forward. INT8/INT4 quantization techniques like GPTQ and AWQ can reduce model size by 50-75% with minimal accuracy loss, while knowledge distillation trains a smaller student model to mimic the larger one. Hardware optimizations like TensorRT and FlashAttention can further improve inference speed but depend on the target deployment environment. The developer is advised to prioritize quantization or distillation for real-world gains, with low-rank methods only considered if the model is retrained from scratch.
- FP16 conversion and ONNX Runtime gave only 2x size reduction, with pruning and graph optimizations offering no further gains.
- INT8/INT4 quantization (GPTQ, AWQ, SmoothQuant) can reduce model size by 50-75% with minimal accuracy loss.
- Knowledge distillation trains a smaller student model, often more effective post-training than low-rank factorization (SVD/LoRA).
Why It Matters
Optimizing transformer models beyond FP16 is crucial for deploying AI on edge devices with limited memory and compute.