Custom AscendC kernels double throughput from 2.88 to 5.90 tokens/s on the Orange Pi AIPro (Ascend 310B, $149)?

Custom AscendC kernels double throughput from 2.88 to 5.90 tokens/s on the Orange Pi AIPro (Ascend 310B, $149).

Optimizations include a custom cube matmul for M=1, chunked lm_head for 248k vocabulary, and vectorized causal-conv1d?

Optimizations include a custom cube matmul for M=1, chunked lm_head for 248k vocabulary, and vectorized causal-conv1d.

Current bottleneck is memory bandwidth (44 GB/s); next step is fused INT4/INT8 dequantization to exceed 12 tokens/s?

Current bottleneck is memory bandwidth (44 GB/s); next step is fused INT4/INT8 dequantization to exceed 12 tokens/s.

Open Source

Custom C++ engine boosts MiniCPM-V 4.6 on Orange Pi to 5.9 tokens/s

r/LocalLLaMA May 25, 2026

⚡A developer bypassed heavy frameworks to double inference speed on a $149 NPU board.

Deep Dive

A developer has open-sourced a custom C++ inference engine that runs the multimodal MiniCPM-V 4.6 model entirely on the Orange Pi AIPro, a budget edge board powered by the Ascend 310B NPU ($149, 20 TOPS INT8). Frustrated by the performance overhead of standard frameworks like torch_npu and aclnnMm, the developer wrote low-level AscendC kernels to directly control the NPU's cube units. The stock baseline delivered just 2.88 tokens/s due to poor vector-matrix multiply utilization when M=1. The new engine achieves 5.90 tokens/s in FP16, a 2x speedup, by keeping the entire hot path in a single C++ subprocess. Python is only used for tokenization and image preprocessing on the cold path.

The speedup came from three key optimizations. First, a custom cube matmul kernel for M=1 bypassed the slow generic vector path, saving 121ms per step and pushing throughput to 4.37 tokens/s. Second, the lm_head (vocabulary size ~248k) was chunked into 16 cube-friendly slices at load time, with sequential matmuls and a host reduce, cutting another 29ms to hit 4.99 tokens/s. Third, a vectorized causal-conv1d step kernel using Unified Buffer DMAs replaced a scalar baseline, saving 30ms and reaching 5.90 tokens/s. The current bottleneck is the board's 44 GB/s memory bandwidth, which takes 170ms to load 1.4 GB of FP16 weights per step (theoretical floor: 32ms). The developer plans to implement fused INT4/INT8 dequantization kernels to push past 12 tokens/s.

Key Points

Custom AscendC kernels double throughput from 2.88 to 5.90 tokens/s on the Orange Pi AIPro (Ascend 310B, $149).
Optimizations include a custom cube matmul for M=1, chunked lm_head for 248k vocabulary, and vectorized causal-conv1d.
Current bottleneck is memory bandwidth (44 GB/s); next step is fused INT4/INT8 dequantization to exceed 12 tokens/s.

Why It Matters

Shows that low-cost edge hardware can efficiently run multimodal AI with custom low-level optimization, challenging reliance on heavy frameworks.

Read Original Article

Custom C++ engine boosts MiniCPM-V 4.6 on Orange Pi to 5.9 tokens/s

Why It Matters

Related Articles

🚀 Stay Ahead in AI