Open Source

DFlash Doubles the T/S Gen Speed of Qwen3.5 27B (BF16) on Mac M5 Max

New DFlash support in oMLX 0.3.5 RC1 doubles generation speed from 9 to 22 tokens/second.

Deep Dive

A significant performance breakthrough has emerged for running large language models locally on Apple hardware. The new DFlash support, integrated into the oMLX 0.3.5 RC1 framework, has effectively doubled the generation speed of the Qwen3.5 27B model when using BF16 precision. Initial benchmark tests on a high-end Mac M5 Max with 128GB of RAM show a dramatic leap from 9 tokens per second (T/S) to 22 T/S. This speed boost directly addresses a major bottleneck for the Qwen3.5 27B model, which is widely regarded as exceptionally capable for its parameter size but was previously limited by its inference speed.

The technical setup for this test used the `Jackrong/MLX-Qwopus3.5-27B-v3-bf16` model as the main model and `z-lab/Qwen3.5-27B-DFlash` as the draft model, both sourced from HuggingFace. The DFlash project, hosted on GitHub, implements a speculative decoding technique where a smaller, faster "draft" model proposes potential token sequences, which are then verified in parallel by the larger, more accurate main model. This method, now optimized for Apple's MLX framework through oMLX, drastically reduces latency. For developers and researchers, this means the powerful reasoning and coding capabilities of the 27-billion-parameter Qwen3.5 model can now be harnessed locally with much more practical responsiveness, enabling more complex agentic workflows and real-time applications directly on a Mac.

Key Points
  • DFlash integration in oMLX 0.3.5 RC1 doubled Qwen3.5 27B (BF16) speed from 9 to 22 T/S on an M5 Max.
  • The technique uses speculative decoding with a draft model to accelerate the larger main model's token generation.
  • This makes the highly capable Qwen3.5 27B model far more practical for local deployment and real-time use cases.

Why It Matters

It makes powerful, open-source LLMs like Qwen3.5 viable for fast, local deployment on consumer Apple hardware, reducing cloud dependency.