v4.7.3
TextGen now runs as a native desktop app with Electron and tensor parallelism for llama.cpp.
Oobabooga's TextGen Web UI, the popular open-source interface for running large language models locally, just dropped version 4.7.3 with several game-changing features. The standout addition is a native desktop app built on Electron: portable builds now open as a real window instead of a browser tab—just download, unzip, and double-click. This eliminates the need for separate start scripts, though power users can still pass --listen or --nowebui to run a headless server. Alongside the desktop shift, the UI got a major overhaul: Noto Sans is replaced by Inter as the default font, emoji refresh/save/delete buttons are now Lucide SVG icons, the chat mode selector became a three-button segmented control, and the chat input is a single rounded card with a circular accent-colored send button. Tab indicators use flat underlines, and sidebar toggles are now 3px hairline handles.
Under the hood, the most impactful change is tensor parallelism for the llama.cpp backend. The new --split-mode flag (replacing --row-split) with the tensor option can make multi-GPU inference 60%+ faster—critical for running large models like Llama 3 or Mistral across multiple GPUs. The ik_llama.cpp backend also sees updates with new quant types. Other improvements include replacing DuckDuckGo HTML scraping with the more robust ddgs library, support for standalone .jinja/.jinja2 instruction template files in the UI, and multiple bug fixes: the Stop button now works during tool call approval and between tool turns, a race condition in ExLlamaV3 backend for concurrent API requests is resolved, and extension settings save correctly inside user_data/extensions. Dependencies are updated to latest versions of llama.cpp, ik_llama.cpp, and transformers. This release makes running local LLMs more performant, more user-friendly, and easier to set up for both beginners and advanced users.
- Native desktop app via Electron: portable builds now run as a standalone window instead of requiring a browser tab
- Tensor parallelism for llama.cpp (--split-mode tensor) delivers 60%+ faster multi-GPU inference
- UI overhaul: Inter font, Lucide icons, segmented chat mode selector, and redesigned chat input
Why It Matters
Local LLM deployment becomes significantly faster and more accessible for developers and power users.