Research & Papers

Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

New system recovers 83% of prompt tokens on agentic traffic, slashing TTFT spikes.

Deep Dive

Irminsul targets a critical performance problem in agentic LLM serving: when agents reuse tokens across turns, standard prefix caches fail because the same tokens appear at different positions. This can cause TTFT spikes of 10-16 seconds even on unchanged content. Prior position-independent caching required correcting RoPE on the full key dimension, an expensive workaround for GQA models. Irminsul instead exploits the natural structure of Multi-Head Latent Attention (MLA, used in DeepSeek-V2/V3/R1, Kimi-K2/Moonlight, GLM-5, and Mistral Large 3), where each KV row separates into a position-free latent c_KV and a 64-dim rotation-correctable k_r. This allows content-addressed caching to replace prefix matching as a first-class primitive.

Evaluated on three real MLA-MoE deployments (DeepSeek-V2-Lite, Kimi Moonlight-16B-A3B, and JoyAI-Flash), Irminsul recovers up to ~83% of prompt tokens above exact-prefix on agentic traffic while ensuring output consistency. It also delivers 63% prefill energy savings per cache hit. The system extends SGLang's radix cache with content-hash keying over CDC-chunked segments and a δ-rotation rule for k_r. These results argue that content-addressed caching should be a native serving primitive, not a retrofit over prefix matching, making agentic AI workloads faster and more cost-effective.

Key Points
  • Recovers up to ~83% of prompt tokens on agentic traffic above exact-prefix caching.
  • Delivers 63% prefill energy savings per cache hit on MLA models like DeepSeek-V2/V3.
  • Uses content-hash keying over CDC-chunked segments and δ-rotation rule for position-free caching.

Why It Matters

Fixes a major bottleneck for agentic AI workflows, enabling faster and cheaper LLM serving at scale.