Replaced 20+ generic tools with 2 dedicated applications (text web browser and PC control) that use numbered menus to avoid model URL errors?

Replaced 20+ generic tools with 2 dedicated applications (text web browser and PC control) that use numbered menus to avoid model URL errors

Each application maintains persistent state; the agent can leave and return without losing context, reducing memory overhead?

Each application maintains persistent state; the agent can leave and return without losing context, reducing memory overhead

Runs on consumer AMD RX6600XT at 70-85 t/s (10k context) using Gemma 4 E4B Q4_K_XL; the smaller model outperformed the larger 26B in this setup?

Runs on consumer AMD RX6600XT at 70-85 t/s (10k context) using Gemma 4 E4B Q4_K_XL; the smaller model outperformed the larger 26B in this setup

Open Source

Gemma 4 agent architecture uses modular 'apps' to boost small model accuracy on local hardware

r/LocalLLaMA July 05, 2026

⚡A custom framework replaces 20+ tools with two persistent apps, reducing URL errors on Gemma 4 models.

Deep Dive

A developer experimenting with local AI agents identified a core frustration: small language models frequently mangle exact text like URLs or part numbers when given too many generic tools. Their solution was to replace a sprawling set of over 20 tools with just two dedicated 'applications' (workflows) that present the agent with scoped, menu-driven interfaces. For example, a web browsing app shows numbered links ("open 1, copy 2") instead of forcing the model to generate raw URLs. Each app maintains its own persistent context, so the agent can leave and return without resetting state. Only a brief reference to the app’s existence remains in main memory, with a tool to re-enter.

The system was tested on a single AMD RX6600XT using llama.cpp’s Vulkan backend, running Gemma 4 at two sizes: 26B (Q4_K_XL) and a smaller E4B variant (also quantized). Despite the larger model’s potential, the E4B performed better under this workflow—likely because the structured apps reduced the cognitive load of tool selection. Performance reached 70-85 tokens/s with multi-token prediction and 800 t/s prefill at a 10k token context. The design is open-ended: the agent can still access ~100 general tools when outside the apps, but for complex multi-step tasks (like finding rare car parts), the application approach dramatically improved reliability. The code is shared as a proof of concept for others building local agent frameworks.

Key Points

Replaced 20+ generic tools with 2 dedicated applications (text web browser and PC control) that use numbered menus to avoid model URL errors
Each application maintains persistent state; the agent can leave and return without losing context, reducing memory overhead
Runs on consumer AMD RX6600XT at 70-85 t/s (10k context) using Gemma 4 E4B Q4_K_XL; the smaller model outperformed the larger 26B in this setup

Why It Matters

Shows that clever architecture can let small local models match larger ones, easing hardware requirements for complex agent tasks.

Read Original Article

Gemma 4 agent architecture uses modular 'apps' to boost small model accuracy on local hardware

Why It Matters

Related Articles

🚀 Stay Ahead in AI