Developer Tools

llama.cpp b9488 adds Qwen3 SSM support, expanding model compatibility

The popular local LLM runner now supports Qwen3's state space models, unlocking new architectures.

Deep Dive

llama.cpp, the popular C++ inference engine for large language models, has released version b9488, bringing official support for Qwen3's SSM (State Space Model) architectures. This update introduces the LLM_KV_ATTENTION_RECURRENT_LAYERS configuration and passes tests for Qwen3 SSM variants. The project, which boasts over 114k stars and 19.1k forks on GitHub, continues to expand its model compatibility beyond traditional transformers, enabling users to run emerging efficient architectures locally.

The release is available across all major platforms, including macOS (Apple Silicon and Intel), Linux (x64, arm64, s390x), Windows (CPU, CUDA 12/13, Vulkan, HIP), Android (arm64), and iOS. This broad support means developers and researchers can experiment with Qwen3's state space models on consumer hardware without cloud dependencies. By integrating SSM architectures, llama.cpp positions itself as a versatile tool for the next wave of efficient AI models, potentially reducing memory and computation requirements while maintaining performance.

Key Points
  • Adds support for Qwen3 SSM (state space model) architectures, a novel alternative to transformers.
  • Introduces LLM_KV_ATTENTION_RECURRENT_LAYERS configuration for recurrent attention mechanisms.
  • Available on 10+ platforms including macOS, Linux, Windows, Android, and iOS with multiple GPU backends (CUDA, Vulkan, ROCm, etc.).

Why It Matters

Enables local execution of state-of-the-art SSM models on personal devices, democratizing advanced AI inference.