Open Source

[Round 2 - Followup] M5 Max 128G Performance tests. I just got my new toy, and here's what it can do. (thank you for the feedback)

New benchmarks reveal the M5 Max's 128GB memory enables 2,845 tokens/sec prompt processing, crushing previous Apple Silicon.

Deep Dive

A new round of detailed benchmarks for Apple's flagship M5 Max chip with 128GB of unified memory reveals its specialized prowess for local AI workloads. The testing, conducted with llama.cpp and MLX frameworks, shows the system's biggest leap is in prompt processing (PP) speed—the rate at which it can ingest and understand long context. On a quantized 35B-A3B Mixture-of-Experts (MoE) model, the M5 Max achieved a blistering 2,845 tokens per second for a 512-token prompt, which is 5.5 times faster than a standard dense 27B model at the same Q6_K quantization level. This performance is enabled by the chip's massive 614 GB/s memory bandwidth and full allocation of the 128GB memory pool to the GPU.

For token generation (TG)—the speed of producing new text—the system remains more bandwidth-bound, with the top-performing 35B-A3B MoE model generating 92.2 tokens/sec. The benchmarks also provided a corrected, apples-to-apples comparison between Apple's native MLX framework and the popular llama.cpp. At equivalent 4-bit quantization, MLX was found to be 30% faster for token generation than llama.cpp on the same hardware, highlighting the benefit of using Apple's optimized software stack. The data solidifies the M5 Max as a top-tier consumer platform for developers and researchers needing to run and experiment with large language models like Qwen 3.5 122B and Gemma 3 27B entirely on-device.

Key Points
  • The M5 Max with 128GB RAM hits 2,845 tok/sec prompt processing on a 35B-A3B MoE model, a 5.5x speedup over dense models.
  • Apple's native MLX framework is 30% faster for token generation than llama.cpp at equivalent 4-bit quantization on the same hardware.
  • The system's 614 GB/s memory bandwidth allows it to efficiently run massive models like the 122B-parameter Qwen 3.5 MoE locally.

Why It Matters

This performance makes high-end Macs serious contenders for local AI development and deployment, reducing reliance on cloud APIs for complex inference tasks.