Open Source

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison.

r/LocalLLaMA March 13, 2026

⚡Real-world tests show MLX's advertised 2x speed advantage disappears with longer context windows due to slow prefill.

Deep Dive

A developer's viral benchmark analysis reveals significant discrepancies between advertised and real-world performance of Apple's MLX framework for running local LLMs on Macs. Testing a Qwen3.5-35B-A3B model on an M1 Max 64GB Mac Studio, the developer found MLX showed impressive 57 tokens/second generation speed in LM Studio's UI compared to 29 tokens/s for the same model in GGUF format via llama.cpp. However, when measuring complete response times from prompt submission to final token output—what users actually experience—the results flipped dramatically.

At 8,496 tokens of context, MLX's effective speed plummeted to just 3 tokens/second, matching GGUF performance, because prefill (processing the entire input before generation) consumed 94% of total response time. This makes MLX misleadingly fast for creative tasks with short contexts but slower for practical workloads like document classification or multi-turn agent conversations. The developer's data shows GGUF consistently outperforms MLX at context lengths above 1,500 tokens, with GGUF delivering 16 effective tokens/s vs MLX's 10 tokens/s at 1,453 tokens.

The findings highlight how memory bandwidth limitations on M1 Max impact prefill performance, raising questions about whether newer M2 through M5 chips with improved bandwidth might narrow the gap. The developer is now testing optimization parameters and comparing LM Studio with Ollama and bare llama.cpp implementations, inviting community participation for more comprehensive benchmarking across Apple Silicon generations.

Key Points

MLX shows 57 tokens/s generation speed but effective speed drops to 3 tokens/s at 8.5K context due to prefill dominating 94% of response time
GGUF models via llama.cpp outperform MLX for document classification and multi-turn conversations despite lower advertised generation speeds
Memory bandwidth limitations on M1 Max significantly impact prefill performance, with unknown effects on newer M2-M5 Apple Silicon chips

Why It Matters

Developers choosing local LLM frameworks need realistic performance metrics, not misleading generation speeds that don't reflect real-world use cases.

Read Original Article

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison.

Why It Matters

Stay Ahead in AI