b8372
The latest update enables 4-bit floating point inference for Nvidia's Nemotron models, boosting speed.
The llama.cpp project, a cornerstone of the open-source AI ecosystem for running models locally, has released a significant update tagged b8372. This release is primarily focused on expanding hardware and model compatibility, most notably by wiring up tensor support for Nvidia's Nemotron-H models and implementing the new NVFP4 (4-bit floating point) data format. This technical integration allows the efficient execution of these advanced models using more compact, faster 4-bit precision on supported Nvidia GPUs, a key step for performance optimization.
The update substantially broadens the practical deployment landscape. By adding explicit support for Nemotron-H, developers and researchers can now leverage this family of models directly within the highly optimized llama.cpp framework. Furthermore, the release includes updated builds across a vast array of platforms, ensuring ready-to-use binaries for macOS on both Apple Silicon and Intel, various Linux configurations (including CPU, Vulkan, ROCm 7.2, and OpenVINO), multiple Windows backends (CUDA 12/13, Vulkan, SYCL), and specialized builds for openEuler on Ascend hardware. This cross-platform effort lowers the barrier to running state-of-the-art models efficiently on everything from consumer laptops to specialized servers.
For users, this means faster inference times and lower memory overhead when running Nemotron models locally, unlocking more complex AI tasks on personal hardware. The commitment to such wide platform support reinforces llama.cpp's role as a universal tool for democratizing high-performance AI inference, making cutting-edge model experimentation more accessible than ever before.
- Adds tensor support for Nvidia's Nemotron-H family of AI models within the framework.
- Implements NVFP4 (4-bit floating point) data format for faster, more memory-efficient inference on Nvidia GPUs.
- Provides pre-built binaries for over a dozen platforms including macOS, Windows, Linux, and openEuler for Ascend chips.
Why It Matters
Enables significantly faster and more efficient local execution of advanced Nvidia models, broadening access to high-performance AI.