Reduces MoE prefill complexity from O(n_as * n_routed_rows) to O(n_as + n_routed_rows) via counting sort?

Reduces MoE prefill complexity from O(n_as * n_routed_rows) to O(n_as + n_routed_rows) via counting sort

Uses precomputed contiguous mapping to group all rows owned by an expert into a single slice?

Uses precomputed contiguous mapping to group all rows owned by an expert into a single slice

Supports builds for macOS, Windows, Linux, Android, and openEuler on CPU, Vulkan, ROCm, CUDA, SYCL, and HIP backends?

Supports builds for macOS, Windows, Linux, Android, and openEuler on CPU, Vulkan, ROCm, CUDA, SYCL, and HIP backends

Developer Tools

llama.cpp b9291 boosts MoE prefill throughput with algorithmic optimizations

llama.cpp Releases May 23, 2026

⚡New release cuts MoE prefill complexity from O(n²) to O(n) for faster inference.

Deep Dive

The llama.cpp team has shipped b9291, a new version of their popular C/C++ LLM inference engine. The headline improvement targets SYCL backend performance for Mixture-of-Experts (MoE) models during the prefill phase. Previously, the k_copy_src1_to_contiguous function used a complex O(n_as * n_routed_rows) operation to arrange expert rows. The new implementation switches to a counting sort-based approach with O(n_as + n_routed_rows) complexity and precomputes a contiguous mapping so all rows 'owned' by an expert are in a single slice with known start and end positions. This reduces overhead dramatically, especially when the number of active slots (n_as) is large.

The release also packages builds for multiple platforms: macOS (Apple Silicon and Intel), Linux (x64, arm64, s390x, Vulkan, ROCm, OpenVINO, SYCL FP32/FP16), Windows (x64 CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler. This cross-platform availability means developers can run the improved MoE inference on everything from Intel GPUs to AMD GPUs and Apple Silicon. The algorithmic change is particularly beneficial for large MoE models like Mixtral, where prefill speed directly impacts prompt processing latency. With b9291, local LLM deployments on heterogeneous hardware gain a meaningful throughput boost without requiring hardware upgrades.

Key Points

Reduces MoE prefill complexity from O(n_as * n_routed_rows) to O(n_as + n_routed_rows) via counting sort
Uses precomputed contiguous mapping to group all rows owned by an expert into a single slice
Supports builds for macOS, Windows, Linux, Android, and openEuler on CPU, Vulkan, ROCm, CUDA, SYCL, and HIP backends

Why It Matters

Faster local MoE inference across more GPU types, reducing prompt latency for developers running LLMs on diverse hardware.

Read Original Article

llama.cpp b9291 boosts MoE prefill throughput with algorithmic optimizations

Why It Matters

Related Articles

🚀 Stay Ahead in AI