MiniMax M2.7 GGUF Investigation, Fixes, Benchmarks
A perplexity bug affecting over a third of quantized models traced to llama.cpp overflow.
The AI optimization team Unsloth has identified and fixed a critical bug causing NaN (Not a Number) errors in perplexity evaluations for quantized versions of the MiniMax-M2.7 model. Their investigation revealed the issue was widespread, affecting 21% to 38% of all MiniMax-M2.7 GGUF files uploaded to Hugging Face by various community groups. The root cause was traced to an overflow within the popular llama.cpp inference library, specifically when processing certain medium-sized quantization types like Q4_K and Q5_K at specific model blocks (32 and 311). Interestingly, lower-bit quantizations like Q2_K were unaffected.
Unsloth has now updated their own repository with corrected GGUF files, though the precise mathematical trigger for the overflow remains unknown. The findings highlight a significant quality control challenge in the open-source AI model ecosystem, where quantization—a process to shrink models for local deployment—can introduce subtle, hard-to-detect bugs. In a related but separate issue, NVIDIA is investigating problems with CUDA 13.2, which over 50 users have confirmed causes gibberish outputs for some low-bit quant models across various architectures, a fixable by downgrading to CUDA 13.1.
- A NaN bug affected 38% (10/26 files) of one popular MiniMax-M2.7 GGUF repository and 21% of Unsloth's own uploads before being fixed.
- The issue was an overflow in llama.cpp, specifically impacting medium-sized quants (Q4_K, Q5_K) while sparing both higher and lower precision versions.
- NVIDIA is separately investigating a confirmed CUDA 13.2 bug causing corrupt outputs for low-bit quant models, advising a rollback to CUDA 13.1 as a workaround.
Why It Matters
Ensures reliability for developers deploying quantized models locally and highlights critical infrastructure bugs in widely-used AI tools.