MXNorm: Reusing MXFP block scales for efficient tensor normalisation
New technique reuses existing data to cut normalization compute by 32x, boosting Llama 3 training speed.
A team of researchers from DeepMind and the University of Edinburgh has introduced MXNorm, a novel method to accelerate the training of large language models like Llama 3. The core innovation is reusing the 'block scales'—values calculated when casting data to the efficient MXFP8 format for matrix multiplication—to estimate the root mean square (RMS) needed for layer normalization. This eliminates the need for a separate, expensive reduction operation, cutting the size of that computation by a factor of 32. The result is a direct attack on a growing performance mismatch: while matrix multiplication has become extremely fast with low-precision formats, auxiliary operations like normalization have remained a bottleneck.
MXNorm acts as a drop-in replacement for the standard RMSNorm layer, requiring no architectural changes. The team validated it by pre-training Llama 3 models at 125M, 1B, and 8B parameter scales, finding the accuracy nearly matched a baseline using full-precision RMSNorm. In practical kernel tests, MXNorm was up to 2.4x faster than RMSNorm. This translated to a measurable 1.3% speedup in entire transformer layers using MXFP8 and a 2.6% speedup with the even lower-precision NVFP4 format. The work, detailed in a new arXiv preprint, demonstrates that optimizing 'overhead' operations is now critical for squeezing maximum performance from modern AI accelerators.
- Reuses MXFP8 block scales to estimate RMS, cutting normalization reduction size by 32x.
- Validated on Llama 3 models (125M to 8B) with minimal training accuracy loss.
- Achieves up to 2.4x kernel speedup and 2.6% overall layer speedup in NVFP4.
Why It Matters
Directly accelerates AI training and inference by optimizing a critical bottleneck, saving time and compute costs for companies running large models.