These are the benchmark results for Gemma4 E4B tested on my iPhone 16 Pro.
Memory bandwidth bottleneck revealed as decode stage lags far behind prefill...
Deep Dive
A Reddit user shared benchmark results showing a 10–20x speed gap between CPU and GPU during Prefill and Decode stages, noting that memory bandwidth is the bottleneck in AI inference. Data centers rely on high-performance HBM, and Korean companies Samsung and SK Hynix are projected to earn a combined $340 billion in operating profit this year.
Key Points
- Gemma4 E4B on iPhone 16 Pro showed a 10–20x performance gap between prefill and decode stages when switching from CPU to GPU.
- Memory bandwidth is identified as the primary bottleneck for AI inference, especially during the decode phase.
- Samsung and SK Hynix are projected to earn $340B combined operating profit in 2024, driven by HBM demand for AI workloads.
Why It Matters
Highlights that memory, not compute, is the real bottleneck in AI inference—benefiting memory manufacturers like Samsung and SK Hynix.