Developer Tools

llama.cpp b9411 adds DeepSeek V3.2 with sparse attention

Open-source inference engine now supports DeepSeek's 685B model with DSA and NVFP4.

Deep Dive

The latest release of llama.cpp (tag b9411) from ggml-org introduces native support for DeepSeek V3.2, the 685B-parameter sparse mixture-of-experts model. This update includes a generic implementation of DeepSeek Sparse Attention (DSA), which leverages a lightning indexer to efficiently compute attention over sparse token patterns. The release also adds support for NVFP4 (NVIDIA FP4 quantization), enabling lower-bit inference on compatible hardware, and includes memory refactoring to separate the KV cache and DSA cache for better performance.

With 114,000 stars and 18,900 forks, llama.cpp remains the go-to framework for running large language models locally. This update specifically targets users who need to run DeepSeek V3.2 on Apple Silicon, Linux, or Windows machines (CPU, CUDA, Vulkan, ROCm). The DSA implementation reduces computational overhead for long-context inference, making it feasible to run 685B parameter models on high-end consumer GPUs. The release also includes various cleanup and optimization commits, ensuring stability across macOS, iOS, Linux, and Windows builds.

Key Points
  • llama.cpp b9411 adds generic DeepSeek Sparse Attention (DSA) with lightning indexer for efficient long-context inference.
  • New support for DeepSeek V3.2 model family and NVFP4 quantization for reduced memory footprint.
  • Memory refactoring separates KV cache and DSA cache, improving performance on Apple Silicon and CUDA GPUs.

Why It Matters

Enables local, high-performance inference of 685B sparse models on consumer hardware, democratizing advanced AI.