Open Source

Autoresearch on Qwen3.5-397B, 36 experiments to reach 20.34 tok/s on M5 Max, honest results

A 4.67x speed boost over baseline was achieved through 36 AI-assisted experiments and novel SSD streaming techniques.

Deep Dive

Developer Anemll has achieved a significant breakthrough in running massive language models on consumer hardware, pushing the 397-billion-parameter Qwen3.5 model to 20.34 tokens per second on a MacBook Pro M5 Max with 128GB RAM. This represents a 4.67x speed increase over the original baseline of 4.36 tok/s on an M3 Max. The 209GB model, which is 4x larger than the available RAM, streams entirely from the SSD using a pure C/Metal inference engine, showcasing the potential of efficient memory management.

The optimization process was accelerated using an 'autoresearch' methodology powered by Claude Code (Anthropic). Anemll directed the research while the AI assistant implemented and benchmarked 36 systematic experiments over a few days—a task that would have taken weeks manually. Each experiment was automatically gated by a perplexity threshold to prevent quality regressions, ensuring that speed gains did not come at the cost of model accuracy. This human-AI collaboration proved highly effective for iterative hardware optimization.

Key technical optimizations that moved the needle included splitting SSD I/O into 16 threads with parallel, page-aligned reads across different SSD channels, which added +1.5 tok/s. Implementing temporal expert prediction, which leveraged a discovered 27% cross-token routing correlation to overlap SSD reads with GPU compute, provided a massive +4.3 tok/s boost. Furthermore, using a sophisticated 3-bit quantization scheme (Unsloth IQ3_XXS/IQ4_XS) not only reduced the model payload by 23% but also surprisingly improved perplexity (5.58 vs 5.62 for 4-bit), contributing +2.3 tok/s.

The work, built upon Dan Woods' original flash-moe project, highlights a classic shifting bottleneck problem in performance engineering. As each bottleneck—from SSD I/O to GPU encoding overhead—was solved, a new one emerged. The final gains came from low-level Metal GPU optimizations like fusing the Q/K/V projection kernel and using command buffer pre-encoding to eliminate microsecond-level submission gaps. This project is a compelling case study in maximizing the performance of state-of-the-art AI models on cutting-edge, yet accessible, Apple Silicon.

Key Points
  • Achieved 20.34 tok/s decode speed for Qwen3.5-397B on M5 Max, a 4.67x gain over the original M3 Max baseline of 4.36 tok/s.
  • Used Claude Code to run 36 AI-assisted experiments with automatic quality gating, compressing weeks of work into days.
  • Critical optimizations included 16-thread parallel SSD I/O (+1.5 tok/s), 3-bit quantization with better perplexity than 4-bit (+2.3 tok/s), and temporal expert prediction (+4.3 tok/s).

Why It Matters

Demonstrates how AI-assisted research and clever engineering can run massive 400B-parameter models at usable speeds on high-end laptops, democratizing access to frontier AI.