NanoCP boosts MoE serving efficiency by up to 3x
NanoCP cuts MoE serving latency by 2x while handling 3x more requests...
Researchers from the Chinese University of Hong Kong (CUHK) and collaborators have introduced NanoCP, a breakthrough system for serving Mixture-of-Experts (MoE) models that addresses critical inefficiencies in existing hybrid data-expert parallelism setups. Traditional MoE serving binds each request’s attention, MoE communication, and KV cache to a single instance, creating an imbalance where long-context requests strain KV cache while large batch sizes bottleneck MoE communication. This leads to EP (expert parallelism) stragglers and fragmented KV memory, inflating tail latency.
NanoCP solves this with request-level dynamic context parallelism (DCP), which decouples MoE communication from KV cache placement and assigns each request a context-parallel degree proportional to its KV footprint. Short requests are processed locally, while long requests distribute attention across multiple instances, effectively 'liquefying' the KV cache across the cluster. Paired with an ahead-of-time (AOT) graph engine and a custom routing-based communication backend, NanoCP maintains up to 3.27x higher request rates under strict time-per-output-token (TPOT) SLOs and reduces P99 tail latency by up to 2.12x.
- NanoCP achieves 1.88x–3.27x higher request rates under strict TPOT SLOs by dynamically balancing KV cache and MoE communication loads
- Reduces P99 tail latency by up to 2.12x by preventing EP stragglers and fragmenting KV memory across instances
- Pairs a custom AOT graph engine with routing-based communication to bridge dynamic and static execution
Why It Matters
Cuts MoE serving latency in half while tripling throughput—critical for scaling next-gen LLMs in production.