Open Source

DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max)

A new MLX-native implementation of speculative decoding achieves 3.3x faster inference on Apple Silicon.

Deep Dive

A developer has created the first native MLX implementation of DFlash speculative decoding specifically for Apple Silicon, achieving dramatic speedups for popular open-source models. Running on an M5 Max with 64GB of unified memory, the system processes the Qwen3.5-9B model at 85 tokens per second—a 3.3x improvement over the 26 tok/s baseline. The technique works by having a small draft model generate 16 tokens in parallel using block diffusion, which the larger target model then verifies in one forward pass, with output remaining bit-for-bit identical to standard greedy decoding.

Key optimizations included a critical patch to support `head_dim=256` in MLX's attention mechanism, sync elision to reduce GPU-CPU communication, and packed QKV projections. The developer discovered that on Apple's unified memory architecture, everything becomes bandwidth-bound, making custom Metal kernels slower than stock MLX operations. Interestingly, on quantized models like Qwen3.5-27B, the optimization landscape flips: the bf16 draft becomes the bottleneck rather than the int4/int8 verification. The work reveals fundamental differences in how speculative decoding performs on bandwidth-constrained hardware versus traditional GPU setups.

Key Points
  • Achieves 85 tok/s (3.3x speedup) on Qwen3.5-9B using M5 Max with 64GB RAM
  • Uses DFlash speculative decoding with parallel draft generation and single-pass verification
  • Reveals Apple Silicon's bandwidth-bound nature makes custom kernels slower than optimized MLX primitives

Why It Matters

Enables significantly faster local AI inference on Macs, making larger models practical for developers and researchers.