NanoCP achieves 1.88x–3.27x higher request rates under strict TPOT SLOs by dynamically balancing KV cache and MoE communication loads?

NanoCP achieves 1.88x–3.27x higher request rates under strict TPOT SLOs by dynamically balancing KV cache and MoE communication loads

Reduces P99 tail latency by up to 2.12x by preventing EP stragglers and fragmenting KV memory across instances?

Reduces P99 tail latency by up to 2.12x by preventing EP stragglers and fragmenting KV memory across instances

Pairs a custom AOT graph engine with routing-based communication to bridge dynamic and static execution?

Pairs a custom AOT graph engine with routing-based communication to bridge dynamic and static execution

Research & Papers

NanoCP boosts MoE serving efficiency by up to 3x

arXiv cs.DC May 21, 2026

⚡NanoCP cuts MoE serving latency by 2x while handling 3x more requests...

Deep Dive

Researchers from the Chinese University of Hong Kong (CUHK) and collaborators have introduced NanoCP, a breakthrough system for serving Mixture-of-Experts (MoE) models that addresses critical inefficiencies in existing hybrid data-expert parallelism setups. Traditional MoE serving binds each request’s attention, MoE communication, and KV cache to a single instance, creating an imbalance where long-context requests strain KV cache while large batch sizes bottleneck MoE communication. This leads to EP (expert parallelism) stragglers and fragmented KV memory, inflating tail latency.

NanoCP solves this with request-level dynamic context parallelism (DCP), which decouples MoE communication from KV cache placement and assigns each request a context-parallel degree proportional to its KV footprint. Short requests are processed locally, while long requests distribute attention across multiple instances, effectively 'liquefying' the KV cache across the cluster. Paired with an ahead-of-time (AOT) graph engine and a custom routing-based communication backend, NanoCP maintains up to 3.27x higher request rates under strict time-per-output-token (TPOT) SLOs and reduces P99 tail latency by up to 2.12x.

Key Points

NanoCP achieves 1.88x–3.27x higher request rates under strict TPOT SLOs by dynamically balancing KV cache and MoE communication loads
Reduces P99 tail latency by up to 2.12x by preventing EP stragglers and fragmenting KV memory across instances
Pairs a custom AOT graph engine with routing-based communication to bridge dynamic and static execution

Why It Matters

Cuts MoE serving latency in half while tripling throughput—critical for scaling next-gen LLMs in production.

Read Original Article

NanoCP boosts MoE serving efficiency by up to 3x

Why It Matters

Related Articles

🚀 Stay Ahead in AI