Research & Papers

INT8 quantization gives me better accuracy than FP16 ! [D]

A developer finds INT8 quantized models outperforming FP16 in accuracy

Deep Dive

A developer working with deep learning models exported via ONNX observed a surprising phenomenon: INT8 post-training quantization delivered higher inference accuracy than FP16, with FP32 as the baseline. This contradicts the typical expectation that lower precision (INT8) would degrade accuracy compared to FP16, which is closer to FP32. The user, Fragrant_Rate_2583, shared their findings on Reddit, sparking discussion about potential causes. Possible explanations include that quantization acts as a regularizer, introducing noise that helps the model generalize better, especially if the FP16 conversion introduces rounding errors or numerical instability in certain layers. Additionally, the specific model architecture and dataset might be more tolerant to INT8's uniform scaling than FP16's dynamic range. The community noted that results can vary widely depending on the model, quantization method, and hardware, emphasizing that no universal rule applies. This case highlights the importance of empirical testing in model optimization.

Key Points
  • INT8 post-training quantization outperformed FP16 in accuracy on an ONNX-exported model
  • FP32 remained the most accurate baseline, but INT8 beat FP16 unexpectedly
  • Potential causes include quantization acting as a regularizer or FP16 numerical instability

Why It Matters

Challenges assumptions about precision trade-offs, urging developers to test quantization empirically for accuracy gains.