llama.cpp b9459 adds f16 GLU kernels for faster Apple Silicon inference
New template-based GLU kernels save memory bandwidth by using native half-precision tensors.
llama.cpp, the popular C++ inference engine for LLMs, just shipped version b9459 with a key optimization for Apple Silicon users. The update replaces hardcoded f32 GLU kernels with a single C++ template that supports both f16 (half) and f32 tensor types. By loading and storing data in the native tensor format (half or float), the new kernels significantly reduce memory bandwidth consumption — a critical bottleneck on unified memory Apple hardware. The actual arithmetic computation remains in float to avoid precision issues in operations like geglu and swiglu.
This change also widens the dispatch gate to allow f16 inputs, meaning models that use half-precision weights can now run GLU operations more efficiently without unnecessary upcasting. For developers running local LLMs (e.g., LLaMA, Mistral) on MacBooks, Mac Studios, or iOS devices, this translates to faster token generation and lower power draw. The update is part of ongoing efforts to optimize Metal backend performance, making llama.cpp even more competitive with frameworks like MLX or Core ML for on-device AI inference.
- Introduced templated GLU kernels supporting both f16 and f32 tensor types, replacing hardcoded f32-only kernels.
- Load/store now uses native half or float format, saving memory bandwidth while keeping ALU math in float for accuracy.
- Dispatch gate opened for f16 inputs, enabling efficient half-precision inference on Apple Silicon (Metal).
Why It Matters
Apple users get faster, more memory-efficient local LLM inference — crucial for running large models on unified memory hardware.