DIY AI Server on Xiaomi 12 Pro Benchmarks Llama.cpp vs Google's LiteRT
A Reddit user (u/Aromatic_Ad_7557) completely redesigned their headless AI server, originally based on a Xiaomi 12 Pro with Snapdragon 8 Gen 1, now running Gemma-4-E4B via both Llama.cpp and Google’s LiteRT. The new build features a custom cooling system: a copper heatsink with a fan on the back, and the front screen removed to mount the phone directly onto an aluminum plate with two fans and a thermal pad. The power supply was rebuilt from the battery’s BMS with a capacitor, two fuses, a crowbar circuit at 4.3V, and a backup fan. The entire assembly is housed in a 3D-printed case on an aluminum extrusion stand.
Benchmarks for Gemma-4-E4B showed Llama.cpp yielding 30.6 tokens per second during prompt processing and 5.7 t/s during generation, with gentle CPU load and low amp draw. LiteRT was slightly faster in generation, but it pegged all CPU cores and drew noticeably more power. GPU acceleration proved impossible: Google AI Edge has no APK for Snapdragon 8 Gen 1, and swapping Qualcomm library files failed. A Vulkan build of llama.cpp also ran into issues. The user concludes the project is a fun DIY success if you have a spare phone, but for practical LLM serving, a Mini PC is easier.