Research & Papers

LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

arXiv cs.DC March 26, 2026

⚡iPhone 16 Pro loses nearly half its throughput in two iterations due to thermal throttling.

Deep Dive

A new research paper from Pranay Tummalapalli and three co-authors provides a crucial reality check for deploying always-on AI agents on consumer devices. The team benchmarked a 4-bit quantized version of the Qwen 2.5 1.5B model across four distinct platforms: a Samsung Galaxy S24 Ultra, an iPhone 16 Pro, a laptop NVIDIA RTX 4050 GPU, and a Raspberry Pi 5 equipped with a Hailo-10H Neural Processing Unit (NPU). The goal was to measure sustained performance—throughput, latency, power, and thermal behavior—under a repeated 258-token prompt, moving beyond short-burst benchmarks to simulate real-world agent use.

The results expose stark platform-level constraints. For flagship smartphones, thermal management supersedes raw compute power as the primary bottleneck. The iPhone 16 Pro lost nearly half its throughput within just two inference iterations due to aggressive thermal throttling. The Samsung S24 Ultra faced a different software-hardware clash: its operating system enforced a hard GPU frequency floor that eventually terminated the inference process entirely. This highlights that mobile AI performance is dictated by a complex interplay of silicon, cooling, and system software.

On dedicated hardware, different limitations emerged. The laptop RTX 4050 GPU, while delivering the highest sustained throughput at 131.7 tokens per second, was constrained by its 34.1-watt power ceiling. In contrast, the Hailo-10H NPU on the Raspberry Pi delivered a consistent 6.9 tokens per second while consuming under 2 watts, showing near-zero performance variance. Remarkably, the Hailo system matched the RTX 4050 in energy proportionality (tokens per joule), achieving similar efficiency at 19x lower throughput. This makes specialized NPUs compelling for low-power, predictable edge applications.

The study's critical takeaway is that deploying LLMs at the edge is less about theoretical FLOPs and more about system-level engineering. Performance is a product of hardware capability, power delivery, thermal design, and software stack. For developers building on-device agents, this means choosing a platform requires understanding its sustained, not peak, behavior under load, as thermal and power limits will define the actual user experience.

Key Points

iPhone 16 Pro throughput dropped ~50% in two iterations due to thermal throttling, not lack of compute.
Samsung S24 Ultra inference was terminated by an OS-enforced GPU frequency floor, a software-hardware constraint.
Raspberry Pi 5 with Hailo-10H NPU ran at 6.9 tok/s under 2W, matching an RTX 4050's energy efficiency at 19x lower speed.

Why It Matters

Reveals thermal and power limits, not chip specs, will define real-world performance for on-device AI assistants.

Read Original Article

LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

Why It Matters

Stay Ahead in AI