v0.18.2-rc0
The latest release candidate introduces memory management for Apple Silicon and faster inference for specific models.
Ollama, the open-source platform for running large language models locally, has pushed a new release candidate: version 0.18.2-rc0. This pre-release focuses on performance enhancements and memory management, particularly for users on Apple Silicon hardware. The most significant change is the introduction of a model eviction scheduler for the MLX framework, Apple's machine learning library for its custom silicon. This allows Ollama to better manage system memory by unloading inactive models, a crucial feature for running multiple or larger models on devices with limited RAM.
Other technical improvements include adding prequantized tensor packing specifically for the Qwen 3.5 model family, which should speed up loading and inference times. The update also includes fixes for quantized embeddings and the SwiGLU activation function within the MLX backend. For users of the web search feature, the release adds a fix to flush output on newlines for the legacy path and registers the feature for the 'openclaw' model. These are incremental but important optimizations that refine the experience of running state-of-the-art open-source models like Qwen 3.5 on personal computers.
- Adds a model eviction scheduler for Apple's MLX framework to free up memory on Apple Silicon Macs.
- Introduces prequantized tensor packing for the Qwen 3.5 model family to improve performance.
- Includes runtime fixes for quantized embeddings and the SwiGLU activation function within the MLX backend.
Why It Matters
Enables more efficient local AI on Apple hardware, letting users run larger models or switch between them without exhausting system memory.