Open Source

LM Studio 0.4.14 adds MTP speculative decoding for faster local LLMs

Speculative decoding speeds up inference by predicting multiple tokens at once.

Deep Dive

LM Studio has finally added support for MTP (Multi-Token Prediction) speculative decoding in its latest beta release (version 0.4.14 Build 2). This feature, which is off by default, allows users to speed up local LLM inference by predicting multiple tokens simultaneously using a draft model. To enable it, users must update their llama.cpp engine to version 2.15.0 and manually select "MTP" under "Manually choose model load parameters" before loading a model.

The update brings a well-known speculative decoding technique to LM Studio's user-friendly interface. MTP works by having a smaller, faster draft model propose several token candidates at once; the main model then verifies them in a single forward pass. This can reduce the number of autoregressive steps, cutting latency dramatically — especially useful for real-time chat and coding assistants running locally. The feature is currently in beta, so performance may vary depending on the model pair and hardware.

Key Points
  • LM Studio 0.4.14 Build 2 (Beta) adds MTP speculative decoding support
  • Requires llama.cpp engine 2.15.0 and manual enabling in model load parameters
  • Speeds up inference by predicting multiple tokens in parallel, reducing latency

Why It Matters

Faster local AI inference means more responsive apps, lower costs, and better real-time experiences.