Open Source

DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max)

r/LocalLLaMA April 12, 2026

⚡A new MLX-native implementation of speculative decoding achieves 3.3x faster inference on Apple Silicon.

Deep Dive

A developer has created the first native MLX implementation of DFlash speculative decoding specifically for Apple Silicon, achieving dramatic speedups for popular open-source models. Running on an M5 Max with 64GB of unified memory, the system processes the Qwen3.5-9B model at 85 tokens per second—a 3.3x improvement over the 26 tok/s baseline. The technique works by having a small draft model generate 16 tokens in parallel using block diffusion, which the larger target model then verifies in one forward pass, with output remaining bit-for-bit identical to standard greedy decoding.

Key optimizations included a critical patch to support `head_dim=256` in MLX's attention mechanism, sync elision to reduce GPU-CPU communication, and packed QKV projections. The developer discovered that on Apple's unified memory architecture, everything becomes bandwidth-bound, making custom Metal kernels slower than stock MLX operations. Interestingly, on quantized models like Qwen3.5-27B, the optimization landscape flips: the bf16 draft becomes the bottleneck rather than the int4/int8 verification. The work reveals fundamental differences in how speculative decoding performs on bandwidth-constrained hardware versus traditional GPU setups.

Key Points

Achieves 85 tok/s (3.3x speedup) on Qwen3.5-9B using M5 Max with 64GB RAM
Uses DFlash speculative decoding with parallel draft generation and single-pass verification
Reveals Apple Silicon's bandwidth-bound nature makes custom kernels slower than optimized MLX primitives

Why It Matters

Enables significantly faster local AI inference on Macs, making larger models practical for developers and researchers.

Read Original Article

DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max)

Why It Matters

Stay Ahead in AI