Benchmarking Local Language Models for Social Robots using Edge Devices
25 LLMs tested on a Raspberry Pi—best model runs at 2.5 tokens/sec.
Social-educational robots like the Robot Study Companion (RSC) demand responsive, privacy-preserving interactions, yet run on severely limited compute like the Raspberry Pi. Researchers from the University of Tartu and University of Ljubljana aimed to systematically benchmark open-source language models for such edge deployment. They tested 25 models across three dimensions: inference efficiency (tokens per second and energy consumption), general knowledge (MMLU subset), and teaching effectiveness (LLM-rated pedagogical quality validated by five human raters). The primary platform was a Raspberry Pi 4, with comparisons on Pi 5 and a laptop GPU.
Results reveal stark trade-offs: throughput varies by over 10x, MMLU accuracy from near-random to 57.2%, and teaching effectiveness does not correlate monotonically with either metric. Granite4 Tiny Hybrid (7B) emerged as the top all-rounder with 2.5 tokens/sec, 0.90 tokens/joule, and 54.6% MMLU—while models with higher MMLU did not necessarily score better on teaching. Human validation confirmed the automated ranking (Pearson r=0.967). Based on this, the authors propose a three-tier local inference architecture that balances responsiveness and accuracy on resource-constrained hardware, enabling real-time, privacy-preserving educational robots.
- 25 open-source LLMs benchmarked on Raspberry Pi 4/5 for social robots like RSC
- Granite4 Tiny Hybrid (7B) tops balance: 2.5 tokens/sec, 0.90 tokens/joule, 54.6% MMLU
- Teaching effectiveness does not need high MMLU; human raters validate automated ranking (r=0.967)
Why It Matters
Enables privacy-preserving, low-cost AI tutors on edge devices, democratizing adaptive education.