ONNX Runtime's QNN EP silently falls back to CPU for unsupported ops, causing latency to triple in production?

ONNX Runtime's QNN EP silently falls back to CPU for unsupported ops, causing latency to triple in production

Median latency gates fail because fallback creates a bimodal distribution; coefficient of variation >15% is a reliable indicator?

Median latency gates fail because fallback creates a bimodal distribution; coefficient of variation >15% is a reliable indicator

Parsing ORT profiling JSON at detailed level identifies the specific op that fell back, enabling targeted fixes?

Parsing ORT profiling JSON at detailed level identifies the specific op that fell back, enabling targeted fixes

Research & Papers

Silent NPU fallback on Snapdragon: ONNX Runtime bug triples production latency

r/MachineLearning May 15, 2026

⚡Eval latency looks fine, but production latency triples due to silent CPU fallback.

Deep Dive

A Reddit post highlights a recurring bug in ONNX Runtime's QNN execution provider (EP) for Qualcomm's Hexagon NPU on Snapdragon SoCs. The EP silently routes unsupported operations to the CPU, but the runtime only emits a startup-log line that most engineers ignore. This creates a bimodal latency distribution: on-device eval looks fine because the median falls on the fast cluster, but production input distributions stress fallback paths differently, tripling latency.

Three CI gates are required to catch the issue. First, run on real hardware—emulators implement the ISA in software so every op appears supported. Second, gate on coefficient of variation (CV) rather than median latency: healthy on-NPU CV is 2–5%, while intermittent fallback pushes it above 15%. Third, parse the ORT profiling JSON with `profiling_level=detailed`, which contains per-op routing information. The post includes Python code for a CV gating function and an ORT profile parser. The pattern also likely applies to TensorRT on Jetson and CoreML on iOS.

Key Points

ONNX Runtime's QNN EP silently falls back to CPU for unsupported ops, causing latency to triple in production
Median latency gates fail because fallback creates a bimodal distribution; coefficient of variation >15% is a reliable indicator
Parsing ORT profiling JSON at detailed level identifies the specific op that fell back, enabling targeted fixes

Why It Matters

For teams deploying ML on Snapdragon, this pattern prevents silent performance regressions that can devastate production latency.

Read Original Article

Silent NPU fallback on Snapdragon: ONNX Runtime bug triples production latency

Why It Matters

Related Articles

Stay Ahead in AI