MTP speculative decoding auto-enabled when loading MTP GGUF models (e.g., Qwen 3.6 MoE) for faster inference?

MTP speculative decoding auto-enabled when loading MTP GGUF models (e.g., Qwen 3.6 MoE) for faster inference

Web search now includes snippets, reducing token usage by avoiding follow-up fetch_webpage calls; URLs are plain text to save tokens?

Web search now includes snippets, reducing token usage by avoiding follow-up fetch_webpage calls; URLs are plain text to save tokens

Live generation speed (tokens/s) and context size displayed during inference; DGX Spark builds added for Linux aarch64?

Live generation speed (tokens/s) and context size displayed during inference; DGX Spark builds added for Linux aarch64

Developer Tools

Oobabooga's textgen v4.9 adds MTP speculation and web upgrades

Text Gen WebUI May 21, 2026

⚡New version brings MTP speculative decoding, faster web search, and live token stats for local LLMs.

Deep Dive

Oobabooga's textgen v4.9 introduces MTP speculative decoding via a new --spec-type 'draft-mtp' option, auto-enabled when loading MTP GGUF models such as Qwen 3.6 MoE. This speeds up inference by using a draft model. Web search gains snippet support, directly answering queries without a follow-up fetch_webpage call, significantly cutting token consumption. Link URLs are now plain text, further reducing per-page tokens. Users can also see live generation speed (tokens/s) and context size while generating, plus a spinner during web search calls.

The update includes DGX Spark support (Linux aarch64 builds), a revamped Electron app with a check-for-updates button, models folder picker, right-click context menu, and spellcheck toggle. One-click installer now tracks the latest release tag instead of main. Security hardened with CORS restricted to localhost by default, path traversal fixes, and rejection of non-HTTP web search links. UI improvements: drag-and-drop file upload, reorganized right sidebar, faded message animations, and fixed streaming leaks across chats. Dependency updates include llama.cpp, ik_llama.cpp, and ExLlamaV3 to latest versions. Portable builds are provided for Windows and Linux with CUDA 12.4/13.1, Vulkan, ROCm 7.2, and CPU-only options.

Key Points

MTP speculative decoding auto-enabled when loading MTP GGUF models (e.g., Qwen 3.6 MoE) for faster inference
Web search now includes snippets, reducing token usage by avoiding follow-up fetch_webpage calls; URLs are plain text to save tokens
Live generation speed (tokens/s) and context size displayed during inference; DGX Spark builds added for Linux aarch64

Why It Matters

Local LLM users get faster, more efficient inference and a more polished, secure desktop experience.

Read Original Article

Oobabooga's textgen v4.9 adds MTP speculation and web upgrades

Why It Matters

Related Articles

🚀 Stay Ahead in AI