FlexNPU virtualizes Ascend NPUs transparently without modifying AI frameworks or model code?

FlexNPU virtualizes Ascend NPUs transparently without modifying AI frameworks or model code

On 384-card Ascend 910C with DeepSeek-R1, throughput improved up to 26.33% over static PD disaggregation?

On 384-card Ascend 910C with DeepSeek-R1, throughput improved up to 26.33% over static PD disaggregation

On Qwen2.5-7B, time-to-first-token (TTFT) reduced by over 92% while sustaining nearly identical decode latency?

On Qwen2.5-7B, time-to-first-token (TTFT) reduced by over 92% while sustaining nearly identical decode latency

Research & Papers

FlexNPU virtualizes Ascend NPUs for 92% faster LLM prefill

arXiv cs.DC June 04, 2026

⚡Transparent virtualization chops LLM time-to-first-token by 92% on Huawei's hardware

Deep Dive

FlexNPU is a transparent user-space virtualization layer for Huawei Ascend NPUs that interposes on AscendCL APIs and routes operations through per-device daemons. This decouples unmodified applications from physical NPUs without altering model code, AI frameworks, or drivers. The key innovation is dynamic prefill-decode (PD) co-location, which adapts scheduling between the compute-heavy prefill phase and the memory-bandwidth-bound decode phase, exploiting their complementary resource characteristics.

In a 384-card Ascend 910C cluster running DeepSeek-R1, FlexNPU boosted throughput by 5.15% and 26.33% over static PD disaggregation, while on Qwen2.5-7B it reduced time-to-first-token (TTFT) by over 92% with nearly unchanged tokens-per-output. The system introduces no measurable inference overhead compared to direct NPU passthrough and even slightly improves throughput in some scenarios. By providing fine-grained runtime control over operator dispatch and phase-aware scheduling, FlexNPU demonstrates that transparent NPU virtualization is a practical foundation for efficient and responsive LLM serving.

Key Points

FlexNPU virtualizes Ascend NPUs transparently without modifying AI frameworks or model code
On 384-card Ascend 910C with DeepSeek-R1, throughput improved up to 26.33% over static PD disaggregation
On Qwen2.5-7B, time-to-first-token (TTFT) reduced by over 92% while sustaining nearly identical decode latency

Why It Matters

Enables dynamic resource sharing for LLM serving, cutting latency and boosting throughput without costly hardware rewrites.

Read Original Article

FlexNPU virtualizes Ascend NPUs for 92% faster LLM prefill

Why It Matters

Related Articles

🚀 Stay Ahead in AI