AMD ROCm backends now use real GPU offload with ggml HIP and bundled hipBLASLt TensileLibrary data?

AMD ROCm backends now use real GPU offload with ggml HIP and bundled hipBLASLt TensileLibrary data

Distributed model loading no longer hangs on dead worker; lock timeout and self-termination prevent VRAM leaks?

Distributed model loading no longer hangs on dead worker; lock timeout and self-termination prevent VRAM leaks

Chat UI adds forking, retry, branching, duplication, and copy-as-Markdown for collaborative workflows?

Chat UI adds forking, retry, branching, duplication, and copy-as-Markdown for collaborative workflows

Developer Tools

LocalAI 4.6.0 boosts AMD ROCm GPU support and chat forking

LocalAI July 05, 2026

⚡AMD ROCm backends now run at full speed on-GPU, no more CPU fallback

Deep Dive

LocalAI 4.6.0 brings a crucial set of AMD ROCm reliability fixes. Audio backends like rocm-qwen3-tts-cpp now compile with `-DGGML_HIP=ON` and link HIP for genuine GPU offload, instead of silently falling back to CPU. The hipBLASLt TensileLibrary data is bundled and the `HIPBLASLT_TENSILE_LIBPATH` environment variable is set, eliminating slow generic kernels. The rocm-vllm backend now installs the correct wheel from the AMD index on Python 3.12, and the ASIC ID table (`amdgpu.ids`) is symlinked so the system can find it. For distributed setups, a dead worker can no longer pin the per-model advisory lock in PostgreSQL—the ~15-minute wedge is gone thanks to bounded load ceilings and context-scoped `lock_timeout`. Orphaned backend workers self-terminate on parent death, preventing VRAM leaks.

The built-in chat UI gains conversation forking: users can regenerate any assistant answer (not just the last one), branch a new chat from any turn, duplicate a chat, or copy the entire conversation as Markdown—all client-side in the React UI. Realtime sessions now eagerly warm the full pipeline (VAD, ASR, LLM, TTS) upfront, eliminating per-model cold-start stalls on the first turn. A new `POST /backend/load` API and "Load into memory" UI button let admins pre-warm models. For observability, PII detections, masks, and blocks are exported as a Prometheus counter (`localai_pii_events_total`), enabling alerting when the filter stops firing. A gallery SSRF fix validates config-URL fetches against private, loopback, link-local, and cloud-metadata addresses. Additional improvements include idempotent backend installs (no more re-pulling existing backends unless forced), tool-calling and reasoning fixes for vLLM and MLX backends, and cloud-proxy compatibility with the newest reasoning models.

Key Points

AMD ROCm backends now use real GPU offload with ggml HIP and bundled hipBLASLt TensileLibrary data
Distributed model loading no longer hangs on dead worker; lock timeout and self-termination prevent VRAM leaks
Chat UI adds forking, retry, branching, duplication, and copy-as-Markdown for collaborative workflows

Why It Matters

LocalAI 4.6.0 makes AMD GPU acceleration reliable and adds collaborative chat features for enterprise workflows

Read Original Article

LocalAI 4.6.0 boosts AMD ROCm GPU support and chat forking

Why It Matters

Related Articles

🚀 Stay Ahead in AI