Open Source

M5 Max vs M3 Max Inference Benchmarks (Qwen3.5, oMLX, 128GB, 40 GPU cores)

r/LocalLLaMA March 28, 2026

⚡New benchmarks show the M5 Max generates AI text 1.4x to 2.9x faster, with massive gains at long context lengths.

Deep Dive

New performance data reveals Apple's latest M5 Max chip is a significant leap forward for on-device AI, particularly for developers running large language models. Benchmarks run on identical 16-inch MacBook Pros with 128GB of unified memory show the M5 Max consistently outperforms its M3 Max predecessor. Using the oMLX framework (v0.2.23) to test three variants of Alibaba's Qwen 3.5 model, the new chip delivered 1.4x to 1.7x faster token generation in standard 1024-prompt/128-generation tests. The most dramatic gains appear in long-context scenarios: at a 65K context length, the 27B dense model ran 2.9x faster on the M5 Max (19.6 vs. 6.8 tokens/second).

These performance improvements are driven by the M5 Max's upgraded GPU Neural Accelerators and a substantial memory bandwidth increase from 400 GB/s to 614 GB/s. This bandwidth is critical for complex, multi-step agentic workflows where an AI needs to maintain context across many steps or make parallel tool calls. The benchmarks also highlight the efficiency of Mixture-of-Experts (MoE) models like the 122B-A10B Qwen; despite its massive size, it generates text faster than a smaller 27B 'dense' model because only 10B parameters are active per token. For developers building agentic applications, the M5 Max's superior batching scalability—achieving 2.54x throughput at a 4x batch size versus degradation on the M3 Max—could be a game-changer for production workloads.

Key Points

The M5 Max generates AI text 1.4x to 2.9x faster than the M3 Max, with the largest gains (up to 4x in prefill) at long context lengths.
A 53% memory bandwidth boost (614 GB/s vs. 400 GB/s) significantly accelerates multi-step agent loops and parallel tool calls.
Mixture-of-Experts (MoE) models like Qwen 3.5 122B-A10B run faster than smaller dense models, as speed depends on active parameters, not total size.

Why It Matters

For developers building on-device AI agents, the M5 Max enables faster, more complex reasoning and tool use without relying on cloud APIs.

Read Original Article

M5 Max vs M3 Max Inference Benchmarks (Qwen3.5, oMLX, 128GB, 40 GPU cores)

Why It Matters

Stay Ahead in AI