Open Source

backend-agnostic tensor parallelism has been merged into llama.cpp

New experimental feature splits models across any GPUs, no CUDA required, for faster inference.

Deep Dive

The llama.cpp project, a leading C++ framework for running LLMs efficiently on consumer hardware, has merged a significant experimental feature: backend-agnostic tensor parallelism. This update fundamentally changes how models are distributed across multiple graphics cards. Previously, efficiently utilizing more than one GPU often required specific, vendor-locked software stacks. The new `-sm tensor` command-line flag enables the system to split a model's computational graph (its tensors) across all available GPUs, regardless of their make, promising major speed increases for inference on large models like Meta's Llama 3.

This 'backend-agnostic' design is the key breakthrough. It means the feature works not only with NVIDIA's dominant CUDA platform but also with alternatives like Vulkan, Metal for Apple Silicon, and other backends llama.cpp supports. This democratizes high-performance, multi-GPU setups for users with AMD cards or integrated graphics. The developers caution that this is an experimental merge; performance gains are not guaranteed and can vary dramatically depending on the specific model architecture. Users are advised to test the new `-sm tensor` mode against the default `-sm layer` (layer-wise parallelism) to see which works best for their hardware and model combination.

Key Points
  • Enables tensor parallelism across multiple GPUs via new `-sm tensor` flag, moving beyond layer-wise (`-sm layer`) splitting.
  • Backend-agnostic design means it works with Vulkan and Metal, not just NVIDIA's CUDA, opening it up to AMD and Apple users.
  • Merged as an experimental feature with warnings that performance is model-dependent and results may be poor without tuning.

Why It Matters

Lowers the barrier for high-speed, multi-GPU AI inference, making powerful model deployment more accessible across hardware ecosystems.