Developer Tools

v0.22.1

MLX batch sampler delivers faster multi-sequence inference across GPUs.

Deep Dive

Ollama, the popular open-source platform for running large language models locally, has released v0.22.1, a pre-release update focused on performance and compatibility enhancements. The headline feature is the MLX batch sampler by contributor jessegross, which now batches the sampler across multiple sequences, enabling parallel processing and significantly improving inference throughput on Apple Silicon GPUs. This change reduces latency for users running complex or multi-turn interactions.

Additional fixes address critical issues: the tokenizer now correctly handles multi-regex BPE offset handling, improving tokenization accuracy for edge cases. The update also resolves a desktop app startup bug that killed active Ollama launch sessions, ensuring smoother operation. For NVIDIA GPU users, the update introduces support for TensorRT Model Optimizer import, allowing optimized models to run directly. Model-level batching support further enhances performance, and several new model additions round out the release.

Key Points
  • MLX batch sampler now processes multiple sequences in parallel, boosting inference speed on Apple Silicon
  • Fixed multi-regex BPE tokenizer offset handling for more accurate tokenization
  • Added NVIDIA TensorRT Model Optimizer import for better GPU compatibility

Why It Matters

Faster local inference and broader GPU support make running AI models more efficient for developers.