The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B)
An open-source tool now lets LLMs automatically optimize their own inference settings, boosting Qwen3.5-27B by 54%.
Developer Raketenkater has launched version 2 of the open-source llm-server, introducing a novel `--ai-tune` feature that allows a large language model to autonomously optimize its own inference parameters for the llama.cpp backend. Instead of users manually tweaking dozens of complex flags, the system runs the LLM in a loop, testing different configurations and caching the fastest one it discovers. In benchmark tests on a multi-GPU rig, this AI-driven tuning delivered dramatic speedups: Qwen3.5-27B (Q4_K_M) jumped from 25.94 to 40.05 tokens per second, a 54% increase, while the massive Qwen3.5-122B model saw its throughput more than quadruple from a baseline of 4.1 tokens/sec.
A key advantage of this approach is its future-proofing. The tuning loop ingests the output of `llama-server --help` as context, meaning the AI can immediately understand and test new performance flags as they are added to upstream projects like llama.cpp or ik_llama.cpp. This automation ensures users continuously benefit from the latest optimizations without manual intervention. The update also brings improved stability and a new terminal user interface (TUI) via `llm-server-gui`. For developers and researchers running local models, this tool significantly lowers the barrier to achieving peak hardware performance.
- The new `--ai-tune` flag enables LLMs to self-optimize llama.cpp inference settings, boosting Qwen3.5-27B performance by 54% to 40.05 tokens/sec.
- The system is self-updating; it reads `llama-server --help` for context, allowing it to utilize new performance flags as soon as they are released.
- Version 2 also includes enhanced stability and a new terminal user interface (TUI) for easier management of local model inference.
Why It Matters
This automates the complex tuning required for peak local LLM performance, making high-speed inference more accessible to developers and researchers.