Ollama is now powered by MLX on Apple Silicon in preview
Apple M5 chips now run AI models 2x faster with 1851 token/s prefill using NVIDIA's NVFP4 format.
Ollama has launched a preview of version 0.19 that fundamentally changes how local AI models run on Apple Silicon Macs by switching to Apple's native MLX machine learning framework. This architectural shift leverages the unified memory architecture of Apple's chips, resulting in dramatic performance gains—benchmarks show prefill speeds jumping from 1154 tokens/second to 1851 tokens/second (a 60% increase) and decode performance doubling from 58 to 134 tokens/second on M5-series processors. The update specifically accelerates demanding workloads like coding agents (Claude Code, OpenCode) and personal assistants (OpenClaw) while maintaining model accuracy through NVIDIA's NVFP4 quantization format.
Beyond raw speed, the update introduces production-grade capabilities with NVFP4 support, ensuring local inference matches cloud deployment results and reducing memory bandwidth requirements. Ollama 0.19 also features intelligent caching improvements that reuse conversation caches across sessions, store snapshots at strategic prompt locations, and implement smarter eviction policies—all contributing to lower memory utilization and faster response times for agentic tasks. Currently optimized for Alibaba's Qwen3.5-35B-A3B model quantized to NVFP4, the preview requires Macs with 32GB+ unified memory and represents the first step toward broader architecture support and custom model imports in future releases.
- 2x faster token generation (134 vs 58 tokens/sec) and 60% faster prefill (1851 vs 1154 tokens/sec) on Apple M5 chips
- Production parity via NVIDIA's NVFP4 quantization format maintains model accuracy while reducing memory bandwidth
- Intelligent caching system reuses conversation caches and stores snapshots for 32GB+ memory Macs running coding agents
Why It Matters
Developers can now run production-quality coding agents locally with cloud-level performance, reducing latency and dependency on external APIs.