FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels
Ternary weights replace multiplications with additions, hitting 32.4 tokens/s on a single Xeon.
A new inference system called FairyFuse, detailed in a paper on arXiv, enables multiplication-free execution of large language models on commodity CPUs. The key innovation is using ternary weights restricted to values -1, 0, and +1, which replace computationally expensive floating-point multiplications with simple conditional additions, subtractions, or no-ops. FairyFuse fuses the eight real-valued sub-GEMV operations of each linear layer into a single AVX-512 loop using masked additions and subtractions, achieving zero floating-point multiplications. Roofline analysis shows that 16x weight compression shifts memory-bound GEMV toward the compute regime on bandwidth-limited CPUs, yielding a 29.6x kernel speedup while offering little benefit on GPUs.
End-to-end, FairyFuse achieves 32.4 tokens per second on a single Intel Xeon 8558P, outperforming llama.cpp's Q4_K_M by 1.24x with near-lossless quality. The model maintains a WikiText-2 perplexity of 5.52 compared to 5.47 for FP16, and downstream accuracy of 66.0% versus 66.0% for FP16. This demonstrates that ternary quantization combined with fused kernel execution can deliver significant performance gains on CPU-only platforms without sacrificing quality. The authors note that the approach is particularly effective on bandwidth-limited CPUs, where memory bandwidth is the primary bottleneck for autoregressive generation, and offers limited benefit on GPUs.
- FairyFuse eliminates floating-point multiplications by using ternary weights (-1, 0, +1) and fused AVX-512 kernels.
- Achieves 29.6x kernel speedup and 32.4 tokens/second on a single Intel Xeon 8558P.
- Outperforms llama.cpp Q4_K_M by 1.24x with near-lossless quality (perplexity 5.52 vs 5.47 FP16).
Why It Matters
Makes high-quality LLM inference viable on CPU-only servers, reducing hardware costs and enabling broader deployment without GPUs.