Prefill (processing prompt) takes 94-99% of wall-clock time at 65K+ context, making tg128 (token generation) nearly irrelevant for short agentic outputs?

Prefill (processing prompt) takes 94-99% of wall-clock time at 65K+ context, making tg128 (token generation) nearly irrelevant for short agentic outputs

Trinity-Mini (MoE 3B/26B) led prefill at 131K context with 923 tokens/sec, beating larger models due to its 8 KV head architecture?

Trinity-Mini (MoE 3B/26B) led prefill at 131K context with 923 tokens/sec, beating larger models due to its 8 KV head architecture

KV cache size and head count matter more than parameter count?

models with fewer KV heads (e.g., 4) saw severe slowdowns at high context, while Mamba2 hybrid (Granite-4.0-H-Small) maintained 875 tokens/sec

Open Source

AI benchmark reveals prefill speed is 94-99% of agentic workload time, not token generation

r/LocalLLaMA July 05, 2026

⚡KV head count beats parameter count for long-context agentic tasks

Deep Dive

A comprehensive long-context benchmark pitted 13 LLMs across dense, MoE, Mamba2 hybrid, and MLA architectures at context sizes from 512 to 131K tokens. Using a RX 7900 XT with llama.cpp and three KV cache tiers, the test revealed prefill (prompt processing) accounts for 94-99% of total agentic task time when output is short (~300 tokens). Token generation speed (tg128) is nearly irrelevant.

Top performers at 131K context included Trinity-Mini (923 tokens/sec), Granite-4.0-H-Small (875), and Ornith-9B (873). The study found KV head count is more critical than total parameters: models with 8 KV heads and 128 dim (like Trinity) scaled better than larger dense models. Notably, GLM-4.7-Flash (MLA) crashed above 16K, and Devstral-24B couldn't complete 131K due to KV cache memory limits. The takeaway: for real-world agentic RAG and coding agents, optimize for prefill speed and KV memory, not raw parameters or generation speed.

Key Points

Prefill (processing prompt) takes 94-99% of wall-clock time at 65K+ context, making tg128 (token generation) nearly irrelevant for short agentic outputs
Trinity-Mini (MoE 3B/26B) led prefill at 131K context with 923 tokens/sec, beating larger models due to its 8 KV head architecture
KV cache size and head count matter more than parameter count: models with fewer KV heads (e.g., 4) saw severe slowdowns at high context, while Mamba2 hybrid (Granite-4.0-H-Small) maintained 875 tokens/sec

Why It Matters

For real RAG and coding agents, focus on prefill speed and KV cache efficiency, not just token generation benchmarks.

Read Original Article

AI benchmark reveals prefill speed is 94-99% of agentic workload time, not token generation

Why It Matters

Related Articles

🚀 Stay Ahead in AI