Open Source

M5-Max Macbook Pro 128GB RAM - Qwen3 Coder Next 8-Bit Benchmark

Apple's MLX framework runs the Qwen3-Coder-Next 8-bit model over 2x faster than Ollama on the same Mac.

Deep Dive

A new benchmark reveals the stark performance advantage of Apple's native MLX framework over the popular Ollama backend for running local coding models. Testing Alibaba's Qwen3-Coder-Next model in 8-bit quantization on an M5 Max MacBook Pro with 128GB RAM, MLX delivered an average throughput of 72 tokens per second across six real-world programming tasks. This performance more than doubled Ollama's average of 35 tokens/sec, with speed advantages ranging from 92% to 118% depending on the task. Furthermore, MLX drastically reduced latency, cutting the Time to First Token (TTFT) by 47% to 58%, making local AI coding assistance feel much more responsive.

The test suite covered a spectrum of coding challenges, from writing simple functions to complex code reviews and debugging tasks. MLX's consistent lead—achieving up to 78 tokens/sec on medium and long-form generation—demonstrates its superior optimization for Apple Silicon. This performance, achieved using the community-hosted 'mlx-community/Qwen3-Coder-Next-8bit' model, provides a compelling case for developers to switch to MLX for local inference. The results highlight that for professionals using high-end Apple hardware, choosing the right local inference backend is critical for maximizing the speed and responsiveness of AI-powered coding tools.

Key Points
  • MLX averaged 72 tokens/sec, more than double Ollama's 35 tokens/sec, a 107% performance increase.
  • Time to First Token (TTFT) was slashed by over 50% with MLX, with latencies as low as 76ms.
  • The benchmark used Alibaba's Qwen3-Coder-Next 8-bit model on an M5 Max Mac with 128GB RAM across six real coding tasks.

Why It Matters

Developers can get near-instant, high-quality AI coding assistance locally by leveraging Apple's optimized MLX framework on high-end Macs.