Jetson Orin NX hits 14 tok/s running Hermes Agent with Gemma 4
This tiny edge device now powers a 26B LLM at usable speeds—silently.
A Reddit user took a Jetson Orin NX—originally gathering dust from a dead robotics project from the Llama-7B era—and turned it into a compact, silent LLM inference machine for Hermes Agent. To meet constraints of silent operation at the now-standard 40W power profile, they physically modified the stock heatsink with a hacksaw and 3D-printed a new case. The goal was over 10 tok/s text generation, 300 tok/s prompt processing, and at least 65K context window—all while looking cool.
After extensive benchmarking of models like Gemma 4 26B and Qwen 3.6 across multiple quantizations, the best result came from Gemma 4 26B A4B UD Q2_K_XL: a 66K context window with 14.65 tok/s at ~8K context and 10.21 tok/s at ~60K context. The system still handles multiple tool calls well with long prompts, proving that edge hardware can now run capable agentic AI locally without cloud dependency.
- Custom hardware mods: hacksaw on stock heatsink and 3D-printed case for silent 40W operation
- Best model: Gemma 4 26B A4B UD Q2_K_XL achieved 14.65 tok/s at 8K context and 10.21 tok/s at 60K
- Achieved 66K context window, enabling complex multi-tool agent tasks on edge hardware
Why It Matters
Local AI agents on small edge devices are now viable, reducing cloud costs and latency for real-time tasks.