Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers
NPUs and GPUs on single-board computers can now run LLMs efficiently...
A new arXiv paper from researchers at the University of Greenwich proposes a multi-dimensional benchmarking methodology for evaluating large language model (LLM) inference on hardware-accelerated single-board computers (SBCs). The study tests four IoT-suitable edge platform configurations, including the Raspberry Pi 5 with Hailo-8 NPU and the NVIDIA Jetson Orin with GPU, measuring token throughput, power efficiency, and physical device size. The authors note that existing edge benchmarking efforts rely on CPU-only inference and lack coverage of genuine SBCs, leaving a gap for structured evaluation of NPU and GPU accelerators.
The results reveal that hardware accelerators like NPUs and GPUs provide substantial benefits over CPU-only setups, enabling practical LLM deployment in privacy-sensitive and connectivity-limited environments such as unmanned vehicles and portable ruggedized operations. The paper offers practical guidance for selecting optimal SBC-accelerator combinations, addressing challenges around data privacy, latency, and cost that are acute in operational technology and defence environments. This work highlights the growing feasibility of running capable LLMs at the edge, moving beyond traditional cloud-centric deployment.
- Benchmarked four SBC configurations including Raspberry Pi 5 with Hailo-8 NPU and Jetson Orin with GPU
- Multi-dimensional evaluation covers token throughput, power efficiency, and physical device size
- NPUs and GPUs significantly outperform CPU-only setups for LLM inference at the edge
Why It Matters
Enables private, low-latency LLM deployment on edge devices for defence and IoT applications.