Developer Tools

llama.cpp b9291 boosts MoE prefill throughput with algorithmic optimizations

New release cuts MoE prefill complexity from O(n²) to O(n) for faster inference.

Deep Dive

The llama.cpp team has shipped b9291, a new version of their popular C/C++ LLM inference engine. The headline improvement targets SYCL backend performance for Mixture-of-Experts (MoE) models during the prefill phase. Previously, the k_copy_src1_to_contiguous function used a complex O(n_as * n_routed_rows) operation to arrange expert rows. The new implementation switches to a counting sort-based approach with O(n_as + n_routed_rows) complexity and precomputes a contiguous mapping so all rows 'owned' by an expert are in a single slice with known start and end positions. This reduces overhead dramatically, especially when the number of active slots (n_as) is large.

The release also packages builds for multiple platforms: macOS (Apple Silicon and Intel), Linux (x64, arm64, s390x, Vulkan, ROCm, OpenVINO, SYCL FP32/FP16), Windows (x64 CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler. This cross-platform availability means developers can run the improved MoE inference on everything from Intel GPUs to AMD GPUs and Apple Silicon. The algorithmic change is particularly beneficial for large MoE models like Mixtral, where prefill speed directly impacts prompt processing latency. With b9291, local LLM deployments on heterogeneous hardware gain a meaningful throughput boost without requiring hardware upgrades.

Key Points
  • Reduces MoE prefill complexity from O(n_as * n_routed_rows) to O(n_as + n_routed_rows) via counting sort
  • Uses precomputed contiguous mapping to group all rows owned by an expert into a single slice
  • Supports builds for macOS, Windows, Linux, Android, and openEuler on CPU, Vulkan, ROCm, CUDA, SYCL, and HIP backends

Why It Matters

Faster local MoE inference across more GPU types, reducing prompt latency for developers running LLMs on diverse hardware.