Quantization and Fast Inference (MEAP) - How much performance are you actually getting from quantization in production? [D]
New MEAP book dives into PTQ, QAT, and sub-8-bit quirks for real-world inference.
Manning Publications' new MEAP (early access) book 'Quantization and Fast Inference' by Kalyan Aranganathan tackles the practical challenges of deploying ML models faster and cheaper. It starts with quantization fundamentals and progresses through post-training quantization (PTQ), quantization-aware training (QAT), runtime packaging, and trade-offs that matter in production. The book goes beyond basic INT8 coverage, diving into activation outliers in LLMs, KV cache pressure, fake quantization workflows, straight-through estimators, and why sub-8-bit formats behave differently in real inference workloads versus papers.
The manuscript balances theoretical derivations with operational questions like memory bandwidth, latency, and deployment cost. As a MEAP release, the book evolves chapter by chapter, and readers get access as it develops. The author emphasizes production constraints rather than benchmark perfection, making it valuable for engineers dealing with hardware-specific behaviors, tooling fragmentation, or accuracy collapse. A limited-time 50% discount code (MLKALYANARANGAN50RE) is available for the community.
- Covers quantization techniques from PTQ and QAT to sub-8-bit formats and their production quirks.
- Addresses pain points like activation outliers in LLMs, KV cache pressure, and memory bandwidth constraints.
- Offers a 50% discount code and free copies for sharing real-world quantization experiences.
Why It Matters
Enables ML teams to slash inference costs and latency without hardware overhauls.