Custom hardware mods?

hacksaw on stock heatsink and 3D-printed case for silent 40W operation

Gemma 4 26B A4B UD Q2_K_XL achieved 14.65 tok/s at 8K context and 10.21 tok/s at 60K

Achieved 66K context window, enabling complex multi-tool agent tasks on edge hardware?

Achieved 66K context window, enabling complex multi-tool agent tasks on edge hardware

Open Source

Jetson Orin NX hits 14 tok/s running Hermes Agent with Gemma 4

r/LocalLLaMA June 09, 2026

⚡This tiny edge device now powers a 26B LLM at usable speeds—silently.

Deep Dive

A Reddit user took a Jetson Orin NX—originally gathering dust from a dead robotics project from the Llama-7B era—and turned it into a compact, silent LLM inference machine for Hermes Agent. To meet constraints of silent operation at the now-standard 40W power profile, they physically modified the stock heatsink with a hacksaw and 3D-printed a new case. The goal was over 10 tok/s text generation, 300 tok/s prompt processing, and at least 65K context window—all while looking cool.

After extensive benchmarking of models like Gemma 4 26B and Qwen 3.6 across multiple quantizations, the best result came from Gemma 4 26B A4B UD Q2_K_XL: a 66K context window with 14.65 tok/s at ~8K context and 10.21 tok/s at ~60K context. The system still handles multiple tool calls well with long prompts, proving that edge hardware can now run capable agentic AI locally without cloud dependency.

Key Points

Custom hardware mods: hacksaw on stock heatsink and 3D-printed case for silent 40W operation
Best model: Gemma 4 26B A4B UD Q2_K_XL achieved 14.65 tok/s at 8K context and 10.21 tok/s at 60K
Achieved 66K context window, enabling complex multi-tool agent tasks on edge hardware

Why It Matters

Local AI agents on small edge devices are now viable, reducing cloud costs and latency for real-time tasks.

Read Original Article

Jetson Orin NX hits 14 tok/s running Hermes Agent with Gemma 4

Why It Matters

Related Articles

Stay Ahead in AI