Open Source

M5 Max Actual Pre-fill performance gains

r/LocalLLaMA March 23, 2026

⚡New tests reveal Apple's 4x AI compute claim depends on brief power surges for 16K token prompts.

Deep Dive

A viral analysis by Reddit user M5_Maxxx has dissected Apple's marketing claim that the new M5 Pro and M5 Max chips deliver "over 4x the peak GPU compute for AI" compared to the M4 generation. The investigation suggests this dramatic gain isn't from raw architectural efficiency alone. Instead, a significant portion stems from the system aggressively allocating extra power for short durations, essentially allowing the Neural Accelerators and GPU cores to run at a higher thermal envelope for a few seconds. This creates a peak performance window ideal for quick tasks but may not be sustainable.

Testing indicates the performance sweet spot aligns perfectly with Apple's disclosed benchmark: prompts of around 16,000 tokens. The user's thermal testing, with 10-second cooldowns between inferences, highlights the bursty nature of the performance. For longer, continuous AI workloads—like generating extensive documents or running complex agentic loops—the speed advantages are expected to diminish as the system hits thermal and power limits, reverting to a more conservative performance profile. This reveals a strategic design choice by Apple, optimizing for the quick, on-device AI interactions typical of consumer use rather than marathon computational sessions.

Key Points

Apple's 4x AI compute claim for M5 Max relies on short-term power bursts, not sustained architectural gains.
Peak performance is optimized for ~16K token prompts, matching Apple's own test parameters for 'time to first token'.
For longer AI tasks, speed gains will likely diminish due to thermal and power constraints.

Why It Matters

Developers must design on-device AI apps for bursty inference, as sustained performance may not match peak benchmarks.

Read Original Article

M5 Max Actual Pre-fill performance gains

Why It Matters

Stay Ahead in AI