MLX vs GGUF (Unsloth) - Qwen3.5 122b-10b
A new benchmark shows Apple's MLX format is dramatically faster than GGUF for running massive 122B parameter models on Macs.
A viral benchmark test has revealed a significant performance gap between two popular model formats for Apple Silicon Macs. A user tested the massive 122-billion-parameter Qwen3.5 model, comparing the MLX-community's 6-bit version against a GGUF (Unsloth) 5-bit quant (Q5_K_XL) on an M4 Max with 128GB RAM. In summarization tasks with 80k and 120k token contexts, the MLX format was decisively faster, achieving 34.7 tokens per second versus GGUF's 15.8 tokens per second—a 2.2x speed advantage. It also used roughly 5GB less peak memory. The time to first token was drastically lower for MLX (110.9 seconds vs 253.9 seconds for the 80k test), making it far more responsive.
Beyond raw speed, the test included a practical coding challenge: implementing a 'Browser OS.' Both models produced functional and nearly indistinguishable code, though the GGUF version required a manual correction for a browser compatibility issue, hinting at a potential quality difference. The results strongly indicate that for Mac users, the MLX format—a framework specifically optimized for Apple's Metal Performance Shaders—is now the clear choice over the more universal GGUF format for large models. This benchmark could accelerate a shift in the local LLM ecosystem, pushing developers and users towards native Apple Silicon frameworks for the best performance, potentially marginalizing GGUF for high-end Mac workflows.
- MLX format ran the 122B Qwen3.5 model 2.2x faster (34.7 vs 15.8 tokens/sec) than GGUF on an M4 Max.
- MLX used ~5GB less peak memory (95.5GB vs 101.1GB) and had drastically lower time-to-first-token (110.9s vs 253.9s).
- In a practical coding test, both produced similar output, but GGUF required a manual fix, suggesting MLX may offer quality parity or better.
Why It Matters
For professionals running large AI models locally on Macs, switching to the MLX format can double inference speed and reduce memory pressure.