Open Source

Running Qwen3.5 397B on M3 Macbook Pro with 48GB RAM at 5 t/s

A developer harnesses Apple's 'LLM in a Flash' to run a 397B-parameter model on consumer hardware, achieving 5.7 t/s.

Deep Dive

Developer Dan Woods has achieved a significant breakthrough in local AI inference by successfully running the colossal Qwen3.5 397B parameter model on a consumer M3 MacBook Pro equipped with just 48GB of unified memory. By creating a custom harness that synthesizes techniques from Andrej Karpathy's auto-research project and the principles outlined in Apple's recent "LLM in a Flash" research paper, Woods's system can generate text at a rate of 5.7 tokens per second. This is a remarkable feat given the model's size, which traditionally requires hundreds of gigabytes of GPU VRAM.

Woods's work demonstrates the power of optimized, memory-efficient inference. The "LLM in a Flash" approach cleverly stores model weights in flash memory (SSD) and dynamically loads only the necessary slices into RAM during computation, drastically reducing memory pressure. Woods notes that his current 5.7 t/s speed is just the beginning; his calculations suggest the same hardware could theoretically reach 18 t/s with further optimization. He also posits that dense models with more predictable weight access patterns could see even greater performance gains, pointing toward a future where running frontier-scale models on personal devices becomes commonplace.

Key Points
  • Runs 397B-parameter Qwen3.5 model on M3 MacBook Pro with only 48GB RAM, achieving 5.7 tokens/sec.
  • Leverages Apple's 'LLM in a Flash' paper to store weights on SSD and load slices dynamically into RAM.
  • Theoretical speed of 18 t/s is possible on same hardware, democratizing access to massive AI models.

Why It Matters

This brings frontier-scale AI model capabilities to powerful consumer hardware, reducing dependency on cloud APIs and expensive server-grade GPUs for inference.