Oobabooga's textgen v4.9 adds MTP speculation and web upgrades
New version brings MTP speculative decoding, faster web search, and live token stats for local LLMs.
Oobabooga's textgen v4.9 introduces MTP speculative decoding via a new --spec-type 'draft-mtp' option, auto-enabled when loading MTP GGUF models such as Qwen 3.6 MoE. This speeds up inference by using a draft model. Web search gains snippet support, directly answering queries without a follow-up fetch_webpage call, significantly cutting token consumption. Link URLs are now plain text, further reducing per-page tokens. Users can also see live generation speed (tokens/s) and context size while generating, plus a spinner during web search calls.
The update includes DGX Spark support (Linux aarch64 builds), a revamped Electron app with a check-for-updates button, models folder picker, right-click context menu, and spellcheck toggle. One-click installer now tracks the latest release tag instead of main. Security hardened with CORS restricted to localhost by default, path traversal fixes, and rejection of non-HTTP web search links. UI improvements: drag-and-drop file upload, reorganized right sidebar, faded message animations, and fixed streaming leaks across chats. Dependency updates include llama.cpp, ik_llama.cpp, and ExLlamaV3 to latest versions. Portable builds are provided for Windows and Linux with CUDA 12.4/13.1, Vulkan, ROCm 7.2, and CPU-only options.
- MTP speculative decoding auto-enabled when loading MTP GGUF models (e.g., Qwen 3.6 MoE) for faster inference
- Web search now includes snippets, reducing token usage by avoiding follow-up fetch_webpage calls; URLs are plain text to save tokens
- Live generation speed (tokens/s) and context size displayed during inference; DGX Spark builds added for Linux aarch64
Why It Matters
Local LLM users get faster, more efficient inference and a more polished, secure desktop experience.