llama.cpp b9263 merges HunyuanOCR into HunyuanVL for improved vision precision
Fixes OCR vision precision by folding HunyuanOCR into the HunyuanVL architecture.
The latest release (b9263) of the popular local LLM inference engine llama.cpp introduces a significant architectural consolidation: HunyuanOCR is now merged directly into HunyuanVL. Previously, HunyuanOCR shared the same Hugging Face architecture and vision layout as HunyuanVL but was implemented as a separate code path that omitted the +0.1 bilinear sampler used by the original reference implementation. This oversight led to reduced OCR precision in vision tasks. By collapsing OCR into the HUNYUANVL projector and HUNYUAN_VL text architecture, the fix ensures consistent application of the bilinear sampler, aligning output quality with the upstream model.
This release also reflects llama.cpp's broad platform support: it's distributed as pre-built binaries for macOS (Apple Silicon with optional KleidiAI acceleration, Intel x64, iOS XCFramework), Linux (x64/ARM/s390x with Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16 backends), Windows (x64/ARM64 with CPU, CUDA 12/13, Vulkan, SYCL, HIP), and Android ARM64. For enterprise users, openEuler builds are also available. The consolidation reduces code complexity and improves maintainability, making it easier for developers to deploy vision-language models with accurate OCR capabilities on local hardware.
- HunyuanOCR merged into HunyuanVL, fixing the missing +0.1 bilinear sampler for improved OCR precision
- Unified under HUNYUANVL projector and HUNYUAN_VT text arch, eliminating separate OCR code paths
- Available across 20+ platform builds including macOS ARM64/Intel, Windows CUDA, Linux Vulkan/ROCm, Android ARM64
Why It Matters
Local vision-language models now deliver more accurate OCR, crucial for document processing and multimodal RAG pipelines.