New skip_build_vit() option disables Vision Transformer initialization for mtmd models, saving memory and startup time?

New skip_build_vit() option disables Vision Transformer initialization for mtmd models, saving memory and startup time.

Reduces memory footprint by 2-4 GB for multimodal LLMs when running text-only inference?

Reduces memory footprint by 2-4 GB for multimodal LLMs when running text-only inference.

Supports all major platforms?

macOS, Linux, Windows, Android, with GPU backends like CUDA, Vulkan, ROCm, and OpenVINO.

Developer Tools

llama.cpp v9493 lets you skip ViT for faster text-only inference

llama.cpp Releases June 03, 2026

⚡New release adds option to bypass vision encoder, saving memory and compute.

Deep Dive

llama.cpp, the popular open-source C++ inference engine for large language models, has released version b9493. The headline feature is a new `skip_build_vit()` option for mtmd (likely a multimodal architecture) that lets users bypass loading the Vision Transformer component entirely. ViT is a significant chunk of memory and computation in multimodal models, so skipping it when running pure text generation can cut memory usage by 2-4 GB and speed up initialization significantly. This is especially valuable for resource-constrained environments like edge devices or shared servers. The commit (a731805) also includes assorted model nitpicks and support for all major platforms: macOS (Apple Silicon, Intel, iOS), Linux (x64, arm64, s390x, with Vulkan, ROCm 7.2, OpenVINO, SYCL FP32), Android arm64, and Windows (x64 CPU/arm64 CPU, CUDA 12/13, Vulkan, HIP). The release is signed with GPG key B5690EEEBB952194 for verified integrity.

Key Points

New skip_build_vit() option disables Vision Transformer initialization for mtmd models, saving memory and startup time.
Reduces memory footprint by 2-4 GB for multimodal LLMs when running text-only inference.
Supports all major platforms: macOS, Linux, Windows, Android, with GPU backends like CUDA, Vulkan, ROCm, and OpenVINO.

Why It Matters

Professionals running local LLMs can now save resources by skipping vision components when they only need text generation.

Read Original Article

llama.cpp v9493 lets you skip ViT for faster text-only inference

Why It Matters

Related Articles

🚀 Stay Ahead in AI