b9049
Run OpenBMB's latest vision-language model locally with improved flash attention support.
The latest b9049 release of ggml-org/llama.cpp, a popular C++ implementation for running large language models locally, now supports MiniCPM-V 4.6, a state-of-the-art multimodal vision-language model from OpenBMB. This update, tagged b9049 and released on May 6, introduces a new branch dedicated to MiniCPM-V 4.6 integration. Key technical improvements include flash attention support via build_attn, slice alignment using n_merge, and borrowing wa_layer_indexes for ViT merger insertion points. The release also fixes code bugs, pre-commit hooks, conversion scripts, and tensor naming conventions, ensuring compatibility with the existing GGUF format.
This release provides pre-compiled binaries for a wide range of platforms: macOS (both Apple Silicon with optional KleidiAI and Intel x64), Linux (x64/arm64/s390x CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows (x64/arm64 CPU, CUDA 12.4 & 13.1, Vulkan, SYCL, HIP), Android arm64, iOS XCFramework, and openEuler variants. Users can now run MiniCPM-V 4.6 locally for tasks like image captioning, visual question answering, and document understanding, benefiting from llama.cpp's efficient inference on CPU or GPU, with optional flash attention for faster processing.
- llama.cpp b9049 adds dedicated support for MiniCPM-V 4.6, a multimodal vision-language model from OpenBMB.
- Includes flash attention support and slice alignment for improved inference performance on local hardware.
- Pre-built binaries available across macOS, Linux, Windows, Android, and iOS for easy deployment.
Why It Matters
Enables on-device multimodal AI for vision-language tasks, reducing cloud dependency and latency.