FlexNPU virtualizes Ascend NPUs for 92% faster LLM prefill
Transparent virtualization chops LLM time-to-first-token by 92% on Huawei's hardware
FlexNPU is a transparent user-space virtualization layer for Huawei Ascend NPUs that interposes on AscendCL APIs and routes operations through per-device daemons. This decouples unmodified applications from physical NPUs without altering model code, AI frameworks, or drivers. The key innovation is dynamic prefill-decode (PD) co-location, which adapts scheduling between the compute-heavy prefill phase and the memory-bandwidth-bound decode phase, exploiting their complementary resource characteristics.
In a 384-card Ascend 910C cluster running DeepSeek-R1, FlexNPU boosted throughput by 5.15% and 26.33% over static PD disaggregation, while on Qwen2.5-7B it reduced time-to-first-token (TTFT) by over 92% with nearly unchanged tokens-per-output. The system introduces no measurable inference overhead compared to direct NPU passthrough and even slightly improves throughput in some scenarios. By providing fine-grained runtime control over operator dispatch and phase-aware scheduling, FlexNPU demonstrates that transparent NPU virtualization is a practical foundation for efficient and responsive LLM serving.
- FlexNPU virtualizes Ascend NPUs transparently without modifying AI frameworks or model code
- On 384-card Ascend 910C with DeepSeek-R1, throughput improved up to 26.33% over static PD disaggregation
- On Qwen2.5-7B, time-to-first-token (TTFT) reduced by over 92% while sustaining nearly identical decode latency
Why It Matters
Enables dynamic resource sharing for LLM serving, cutting latency and boosting throughput without costly hardware rewrites.