PSA: llama-swap released a new grouping feature, matrix, allowing you to fine tune which models can run together
Define any combination of LLMs, STT, or RAG models to run together efficiently.
llama-swap, the popular model serving proxy, now offers matrix – a declarative, solver-based replacement for its static grouping system. Instead of a single group per model, you define expressive sets using a DSL with operators like & (AND), | (OR), and +ref (inline other sets). For example, "(g | q | m) & v" creates three concurrent combos: gemma+voxtral, qwen+voxtral, or mistral+voxtral. The solver then determines the cheapest eviction path when a new model request arrives, comparing evict_costs assigned per model (e.g., v:50 for a vllm backend with slow cold start, L:30 for a 70B model). This ensures expensive-to-reload models stay loaded unless absolutely necessary.
Practical impact: you can run an LLM alongside a TTS model without losing the TTS when switching LLMs, or keep a reranker loaded while swapping between smaller LLMs. Models not listed in any set run alone by default. This granular control replaces the legacy all-or-nothing groups, giving operators fine-grained concurrency optimization for multi-model pipelines. The config uses vars for short model aliases, evict_costs for relative unload penalties, and sets for named combinations – all in a single YAML file. It's a powerful upgrade for anyone juggling multiple AI models on limited GPU memory.
- Matrix uses a DSL with &, |, +ref operators for flexible concurrent model sets.
- Evict_costs assign relative cost (e.g., v:50 for slow vllm cold start) to prioritize keeping expensive models loaded.
- Models not in any set run alone; solver finds cheapest eviction path to minimize reload overhead.
Why It Matters
Fine-grained concurrency control means fewer unnecessary model loads, saving GPU time and latency for multi-model workflows.