Audio & Speech

Whisper-small quantization: 57% smaller, better accuracy on edge

Dynamic int8 with Quanto cuts model size without retraining or accuracy loss.

Deep Dive

Large speech recognition models like OpenAI's Whisper-small are powerful but too computationally heavy for edge devices. A new study from Arthur Söhler, Julian Irigoyen, and Andreas Søeborg Kirkedal, accepted at the SPEAKABLE workshop (LREC 2026), systematically evaluates post-training quantization (PTQ) — reducing numerical precision without retraining — to shrink Whisper-small while preserving accuracy. The team tested four quantization libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes, examining different schemes (dynamic/static), methods, granularities, and bit-widths on the LibriSpeech test sets.

Their key finding: dynamic int8 quantization with Quanto offers the best trade-off, cutting model size by 57% and even slightly improving word error rate compared to the baseline. Static quantization performed worse, likely due to Whisper's Transformer architecture. More aggressive formats like nf4 and int3 achieved up to 71% compression but showed notable accuracy degradation on noisy speech (test-other). The paper provides a clear roadmap for selecting PTQ configurations, enabling efficient Whisper-small deployment on resource-constrained hardware without the cost of retraining.

Key Points
  • Dynamic int8 quantization with Optimum-Quanto reduces Whisper-small size by 57% while improving word error rate.
  • Static quantization underperforms due to Transformer architecture incompatibilities.
  • Aggressive formats (nf4, int3) achieve 71% compression but lose accuracy in noisy conditions.

Why It Matters

Enables Whisper-small to run efficiently on edge devices, opening up on-device speech recognition without cloud dependency.