Open Source

New APEX GGUF of Qwen3.6-35B-A3B enables self-speculative decoding without draft model

Bundled MTP head offers inference speedups with just one file and recent llama.cpp

Deep Dive

mudler has released APEX-MTP GGUF quantizations of the Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled model, combining the open-source 35B MoE architecture with reasoning distillation from Claude 4.7 Opus. The headline feature is the bundled multi-token prediction (MTP) head, which enables self-speculative decoding using a single GGUF file — no separate draft model required. This leverages a recent llama.cpp PR (22673) that supports using the MTP head as a draft model. Users just run `llama-server -m model.gguf --draft-mtp` to get inference speedups, typically 2-3x depending on hardware and settings. The model has 40 trunk layers plus one MTP layer, 256 routed experts (8 active per token) and 1 shared expert, with 2048 hidden size.

The APEX quantization methodology is tailored for MoE models: routed experts receive the heaviest compression while the always-active shared experts and attention layers are kept at higher precision. The MTP head is quantized to Q8_0 (near-lossless) on all tiers except I-Nano, where it uses Q4_K to keep file size manageable. File sizes increase only ~2.5% over non-MTP versions (~1 GB extra per file). A notable limitation is that the MTP head cannot yet use activation-aware quantization (imatrix) because it only fires during speculative decoding; the creator is working on a patch. APEX offers multiple quantization tiers (I-Nano, I-Balanced, etc.) calibrated with diverse data including chat, code, reasoning, and agentic traces. The quantizations are hosted as free independent research; larger 200B+ models require rented H100/H200/Blackwell compute.

Key Points
  • Bundles the MTP head directly in the GGUF file, enabling self-speculative decoding via `--draft-mtp` with no separate draft model needed
  • Uses APEX mixed-precision quantization: routed experts compressed hardest, shared experts/attention kept high precision for quality retention
  • Available in multiple tiers (I-Nano, I-Balanced) with MTP head at near-lossless Q8_0; file sizes only ~2.5% larger than non-MTP versions

Why It Matters

Enables faster local inference on large MoE models without additional model downloads, lowering hardware requirements for advanced reasoning deployment.