Developer Tools

v0.18.4

The latest release patches critical memory leaks in MLX and resolves Grok model attention issues.

Deep Dive

Ollama, the open-source platform for running large language models locally, has released version 0.18.4 as a pre-release candidate. This maintenance update focuses on backend stability and compatibility fixes rather than introducing new features. The release addresses a critical memory leak in the MLX backend (used for Apple Silicon) related to the Key-Value (KV) cache snapshot system during model inference. This patch prevents unnecessary memory consumption during prolonged model sessions.

Another significant change forces flash attention to be disabled for Grok models, resolving compatibility issues that could cause crashes or incorrect outputs. The update also includes minor documentation improvements for the VS Code extension and a UI tweak to hide VS Code references in certain contexts. These technical patches are essential for developers relying on Ollama for stable, local inference of models like xAI's Grok or Meta's Llama 3, particularly on macOS hardware where the MLX backend is crucial for performance.

Key Points
  • Patches a memory leak in the MLX backend's KV cache snapshot system, improving stability on Apple Silicon
  • Forces flash attention off for Grok models to resolve compatibility and crashing issues
  • Includes minor updates to VS Code extension documentation and launch configuration

Why It Matters

Ensures stable, memory-efficient local AI inference for developers, especially those using Grok or running models on Apple hardware.