New /models/sse endpoint for real-time model loading progress via server-sent events?

New /models/sse endpoint for real-time model loading progress via server-sent events.

Supports multiple platforms?

macOS, Linux, Windows, Android with various backends (CPU, CUDA, Vulkan, ROCm, etc.).

Mutex added for thread safety in notify_to_router; documentation updated?

Mutex added for thread safety in notify_to_router; documentation updated.

Developer Tools

llama.cpp b9747 adds real-time model load progress tracking via SSE

llama.cpp Releases June 22, 2026

⚡Track LLM loading progress live with new /models/sse endpoint in llama.cpp server.

Deep Dive

The open-source llama.cpp project has released version b9747, bringing a significant quality-of-life improvement to its HTTP server mode. The headline feature is real-time model load progress tracking via a new /models/sse endpoint using Server-Sent Events (SSE). This allows clients to subscribe to progress updates while a large language model is being loaded into memory, providing visibility into a previously opaque process.

Under the hood, the implementation includes a mutex for thread-safe notification routing and updated documentation. This feature is particularly useful for developers running self-hosted inference servers or embedding llama.cpp into larger applications, as it enables progress bars, status indicators, or logging during model initialization.

The release also updates build artifacts across a wide range of platforms: macOS (Apple Silicon arm64 with and without KleidiAI, Intel x64, iOS XCFramework), Linux (Ubuntu x64/arm64 with CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL), Windows (x64/arm64 CPU, CUDA 12/13, Vulkan, OpenCL Adreno, OpenVINO, SYCL, HIP), and Android (arm64 CPU). Builds for openEuler are currently disabled.

For the local LLM community, this update reduces friction when deploying large models like Llama, Mistral, or Falcon on consumer hardware. Real-time progress tracking helps users differentiate between a frozen system and a long load, improving the overall reliability and user experience of self-hosted AI tools.

Key Points

New /models/sse endpoint for real-time model loading progress via server-sent events.
Supports multiple platforms: macOS, Linux, Windows, Android with various backends (CPU, CUDA, Vulkan, ROCm, etc.).
Mutex added for thread safety in notify_to_router; documentation updated.

Why It Matters

Improves transparency and debugging when loading large LLMs locally, essential for developers and self-hosted AI.

Read Original Article

llama.cpp b9747 adds real-time model load progress tracking via SSE

Why It Matters

Related Articles

🚀 Stay Ahead in AI