AI benchmark reveals prefill speed is 94-99% of agentic workload time, not token generation
KV head count beats parameter count for long-context agentic tasks
A comprehensive long-context benchmark pitted 13 LLMs across dense, MoE, Mamba2 hybrid, and MLA architectures at context sizes from 512 to 131K tokens. Using a RX 7900 XT with llama.cpp and three KV cache tiers, the test revealed prefill (prompt processing) accounts for 94-99% of total agentic task time when output is short (~300 tokens). Token generation speed (tg128) is nearly irrelevant.
Top performers at 131K context included Trinity-Mini (923 tokens/sec), Granite-4.0-H-Small (875), and Ornith-9B (873). The study found KV head count is more critical than total parameters: models with 8 KV heads and 128 dim (like Trinity) scaled better than larger dense models. Notably, GLM-4.7-Flash (MLA) crashed above 16K, and Devstral-24B couldn't complete 131K due to KV cache memory limits. The takeaway: for real-world agentic RAG and coding agents, optimize for prefill speed and KV memory, not raw parameters or generation speed.
- Prefill (processing prompt) takes 94-99% of wall-clock time at 65K+ context, making tg128 (token generation) nearly irrelevant for short agentic outputs
- Trinity-Mini (MoE 3B/26B) led prefill at 131K context with 923 tokens/sec, beating larger models due to its 8 KV head architecture
- KV cache size and head count matter more than parameter count: models with fewer KV heads (e.g., 4) saw severe slowdowns at high context, while Mamba2 hybrid (Granite-4.0-H-Small) maintained 875 tokens/sec
Why It Matters
For real RAG and coding agents, focus on prefill speed and KV cache efficiency, not just token generation benchmarks.