Developer Tools

Oobabooga's textgen v4.9 adds MTP speculation and web upgrades

New version brings MTP speculative decoding, faster web search, and live token stats for local LLMs.

Deep Dive

Oobabooga's textgen v4.9 introduces MTP speculative decoding via a new --spec-type 'draft-mtp' option, auto-enabled when loading MTP GGUF models such as Qwen 3.6 MoE. This speeds up inference by using a draft model. Web search gains snippet support, directly answering queries without a follow-up fetch_webpage call, significantly cutting token consumption. Link URLs are now plain text, further reducing per-page tokens. Users can also see live generation speed (tokens/s) and context size while generating, plus a spinner during web search calls.

The update includes DGX Spark support (Linux aarch64 builds), a revamped Electron app with a check-for-updates button, models folder picker, right-click context menu, and spellcheck toggle. One-click installer now tracks the latest release tag instead of main. Security hardened with CORS restricted to localhost by default, path traversal fixes, and rejection of non-HTTP web search links. UI improvements: drag-and-drop file upload, reorganized right sidebar, faded message animations, and fixed streaming leaks across chats. Dependency updates include llama.cpp, ik_llama.cpp, and ExLlamaV3 to latest versions. Portable builds are provided for Windows and Linux with CUDA 12.4/13.1, Vulkan, ROCm 7.2, and CPU-only options.

Key Points
  • MTP speculative decoding auto-enabled when loading MTP GGUF models (e.g., Qwen 3.6 MoE) for faster inference
  • Web search now includes snippets, reducing token usage by avoiding follow-up fetch_webpage calls; URLs are plain text to save tokens
  • Live generation speed (tokens/s) and context size displayed during inference; DGX Spark builds added for Linux aarch64

Why It Matters

Local LLM users get faster, more efficient inference and a more polished, secure desktop experience.