Open Source

Jetson Orin NX hits 14 tok/s running Hermes Agent with Gemma 4

This tiny edge device now powers a 26B LLM at usable speeds—silently.

Deep Dive

A Reddit user took a Jetson Orin NX—originally gathering dust from a dead robotics project from the Llama-7B era—and turned it into a compact, silent LLM inference machine for Hermes Agent. To meet constraints of silent operation at the now-standard 40W power profile, they physically modified the stock heatsink with a hacksaw and 3D-printed a new case. The goal was over 10 tok/s text generation, 300 tok/s prompt processing, and at least 65K context window—all while looking cool.

After extensive benchmarking of models like Gemma 4 26B and Qwen 3.6 across multiple quantizations, the best result came from Gemma 4 26B A4B UD Q2_K_XL: a 66K context window with 14.65 tok/s at ~8K context and 10.21 tok/s at ~60K context. The system still handles multiple tool calls well with long prompts, proving that edge hardware can now run capable agentic AI locally without cloud dependency.

Key Points
  • Custom hardware mods: hacksaw on stock heatsink and 3D-printed case for silent 40W operation
  • Best model: Gemma 4 26B A4B UD Q2_K_XL achieved 14.65 tok/s at 8K context and 10.21 tok/s at 60K
  • Achieved 66K context window, enabling complex multi-tool agent tasks on edge hardware

Why It Matters

Local AI agents on small edge devices are now viable, reducing cloud costs and latency for real-time tasks.