Developer Tools

v0.22.0

MLX batch sampler speeds up inference across multiple sequences by 2x.

Deep Dive

Ollama, the popular open-source tool for running large language models locally, has released v0.22.0 as a pre-release. The update introduces an MLX batch sampler that batches sampling across multiple sequences, significantly speeding up inference when processing multiple queries concurrently. This is particularly beneficial for users running chatbots or multi-turn conversations.

Additionally, the release fixes a tokenizer bug related to multi-regex BPE offset handling, ensuring more accurate text processing. Most notably, Ollama now supports importing NVIDIA TensorRT Model Optimizer models, expanding hardware compatibility for GPU-accelerated inference. These improvements make local LLM deployment faster and more versatile for developers and power users.

Key Points
  • MLX batch sampler batches sampling across multiple sequences for faster concurrent inference
  • Fixed multi-regex BPE offset handling in tokenizer for accurate text processing
  • Added support for importing NVIDIA TensorRT Model Optimizer models

Why It Matters

Ollama v0.22.0 speeds up local LLM inference and expands GPU support, making self-hosted AI more practical.