Developer Tools

llama.cpp v9493 lets you skip ViT for faster text-only inference

New release adds option to bypass vision encoder, saving memory and compute.

Deep Dive

llama.cpp, the popular open-source C++ inference engine for large language models, has released version b9493. The headline feature is a new `skip_build_vit()` option for mtmd (likely a multimodal architecture) that lets users bypass loading the Vision Transformer component entirely. ViT is a significant chunk of memory and computation in multimodal models, so skipping it when running pure text generation can cut memory usage by 2-4 GB and speed up initialization significantly. This is especially valuable for resource-constrained environments like edge devices or shared servers. The commit (a731805) also includes assorted model nitpicks and support for all major platforms: macOS (Apple Silicon, Intel, iOS), Linux (x64, arm64, s390x, with Vulkan, ROCm 7.2, OpenVINO, SYCL FP32), Android arm64, and Windows (x64 CPU/arm64 CPU, CUDA 12/13, Vulkan, HIP). The release is signed with GPG key B5690EEEBB952194 for verified integrity.

Key Points
  • New skip_build_vit() option disables Vision Transformer initialization for mtmd models, saving memory and startup time.
  • Reduces memory footprint by 2-4 GB for multimodal LLMs when running text-only inference.
  • Supports all major platforms: macOS, Linux, Windows, Android, with GPU backends like CUDA, Vulkan, ROCm, and OpenVINO.

Why It Matters

Professionals running local LLMs can now save resources by skipping vision components when they only need text generation.