llama.cpp gets Vulkan Tensor Parallel support from dev Piotr
A single PR makes multi-GPU LLM inference viable on Vulkan.
Deep Dive
The legend Piotr has made Vulkan Tensor Parallel somewhat usable, and the community is excited to see it evolve.
Key Points
- Pull request #25051 by pwilkin adds Vulkan Tensor Parallel to llama.cpp.
- Enables splitting LLM inference across multiple GPUs via Vulkan API.
- Early state — described as 'somewhat usable' — but lays foundation for multi-GPU support without CUDA.
Why It Matters
Multi-GPU LLM inference on Vulkan democratizes local AI, reducing reliance on proprietary Nvidia stacks.