Bundles the MTP head directly in the GGUF file, enabling self-speculative decoding via `--draft-mtp` with no separate draft model needed?

Bundles the MTP head directly in the GGUF file, enabling self-speculative decoding via `--draft-mtp` with no separate draft model needed

Uses APEX mixed-precision quantization?

routed experts compressed hardest, shared experts/attention kept high precision for quality retention

Available in multiple tiers (I-Nano, I-Balanced) with MTP head at near-lossless Q8_0; file sizes only ~2.5% larger than non-MTP versions?

Available in multiple tiers (I-Nano, I-Balanced) with MTP head at near-lossless Q8_0; file sizes only ~2.5% larger than non-MTP versions

Open Source

New APEX GGUF of Qwen3.6-35B-A3B enables self-speculative decoding without draft model

r/LocalLLaMA May 31, 2026

⚡Bundled MTP head offers inference speedups with just one file and recent llama.cpp

Deep Dive

mudler has released APEX-MTP GGUF quantizations of the Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled model, combining the open-source 35B MoE architecture with reasoning distillation from Claude 4.7 Opus. The headline feature is the bundled multi-token prediction (MTP) head, which enables self-speculative decoding using a single GGUF file — no separate draft model required. This leverages a recent llama.cpp PR (22673) that supports using the MTP head as a draft model. Users just run `llama-server -m model.gguf --draft-mtp` to get inference speedups, typically 2-3x depending on hardware and settings. The model has 40 trunk layers plus one MTP layer, 256 routed experts (8 active per token) and 1 shared expert, with 2048 hidden size.

The APEX quantization methodology is tailored for MoE models: routed experts receive the heaviest compression while the always-active shared experts and attention layers are kept at higher precision. The MTP head is quantized to Q8_0 (near-lossless) on all tiers except I-Nano, where it uses Q4_K to keep file size manageable. File sizes increase only ~2.5% over non-MTP versions (~1 GB extra per file). A notable limitation is that the MTP head cannot yet use activation-aware quantization (imatrix) because it only fires during speculative decoding; the creator is working on a patch. APEX offers multiple quantization tiers (I-Nano, I-Balanced, etc.) calibrated with diverse data including chat, code, reasoning, and agentic traces. The quantizations are hosted as free independent research; larger 200B+ models require rented H100/H200/Blackwell compute.

Key Points

Bundles the MTP head directly in the GGUF file, enabling self-speculative decoding via `--draft-mtp` with no separate draft model needed
Uses APEX mixed-precision quantization: routed experts compressed hardest, shared experts/attention kept high precision for quality retention
Available in multiple tiers (I-Nano, I-Balanced) with MTP head at near-lossless Q8_0; file sizes only ~2.5% larger than non-MTP versions

Why It Matters

Enables faster local inference on large MoE models without additional model downloads, lowering hardware requirements for advanced reasoning deployment.

Read Original Article

New APEX GGUF of Qwen3.6-35B-A3B enables self-speculative decoding without draft model

Why It Matters

Related Articles

🚀 Stay Ahead in AI