LocalAI 4.6.0 boosts AMD ROCm GPU support and chat forking
AMD ROCm backends now run at full speed on-GPU, no more CPU fallback
LocalAI 4.6.0 brings a crucial set of AMD ROCm reliability fixes. Audio backends like rocm-qwen3-tts-cpp now compile with `-DGGML_HIP=ON` and link HIP for genuine GPU offload, instead of silently falling back to CPU. The hipBLASLt TensileLibrary data is bundled and the `HIPBLASLT_TENSILE_LIBPATH` environment variable is set, eliminating slow generic kernels. The rocm-vllm backend now installs the correct wheel from the AMD index on Python 3.12, and the ASIC ID table (`amdgpu.ids`) is symlinked so the system can find it. For distributed setups, a dead worker can no longer pin the per-model advisory lock in PostgreSQL—the ~15-minute wedge is gone thanks to bounded load ceilings and context-scoped `lock_timeout`. Orphaned backend workers self-terminate on parent death, preventing VRAM leaks.
The built-in chat UI gains conversation forking: users can regenerate any assistant answer (not just the last one), branch a new chat from any turn, duplicate a chat, or copy the entire conversation as Markdown—all client-side in the React UI. Realtime sessions now eagerly warm the full pipeline (VAD, ASR, LLM, TTS) upfront, eliminating per-model cold-start stalls on the first turn. A new `POST /backend/load` API and "Load into memory" UI button let admins pre-warm models. For observability, PII detections, masks, and blocks are exported as a Prometheus counter (`localai_pii_events_total`), enabling alerting when the filter stops firing. A gallery SSRF fix validates config-URL fetches against private, loopback, link-local, and cloud-metadata addresses. Additional improvements include idempotent backend installs (no more re-pulling existing backends unless forced), tool-calling and reasoning fixes for vLLM and MLX backends, and cloud-proxy compatibility with the newest reasoning models.
- AMD ROCm backends now use real GPU offload with ggml HIP and bundled hipBLASLt TensileLibrary data
- Distributed model loading no longer hangs on dead worker; lock timeout and self-termination prevent VRAM leaks
- Chat UI adds forking, retry, branching, duplication, and copy-as-Markdown for collaborative workflows
Why It Matters
LocalAI 4.6.0 makes AMD GPU acceleration reliable and adds collaborative chat features for enterprise workflows