PSA: If you want to test new models, use llama.cpp/transformers/vLLM/SGLang
Viral PSA reveals Ollama and LMStudio cause garbage responses in new Qwen 2.5 models.
A viral PSA from the developer community is calling out popular local AI runtimes for corrupting the performance of cutting-edge models. The post clarifies that widespread reports of garbage outputs, broken tool calls, and failed chain-of-thought reasoning from new Qwen 2.5 models are not the model's fault, but artifacts of using suboptimal inference servers like Ollama and LMStudio. These frameworks, often built on older versions of llama.cpp, fail to properly implement model-specific features and required sampling parameters, leading developers to incorrectly blame model quality.
Specifically, LMStudio incorrectly tries to parse a model's internal <thinking> tags as tool calls and lacks support for the 'presence penalty' parameter required by newer Qwen releases. Ollama is criticized for broader performance and reliability issues. The PSA urges developers and researchers to test models using robust, up-to-date inference engines like the official llama.cpp server, Hugging Face's Transformers, vLLM, or SGLang to get accurate, uncorrupted results. This highlights a critical gap in the local AI toolchain where convenience-focused wrappers are lagging behind the rapid pace of model development.
- Ollama and LMStudio cause garbage responses & broken tool calls in new Qwen 2.5 models.
- LMStudio fails to support 'presence penalty' and incorrectly parses model <thinking> tags.
- For accurate testing, use llama.cpp, Hugging Face Transformers, vLLM, or SGLang instead.
Why It Matters
Developers risk misjudging model capabilities and wasting time debugging issues caused by their runtime, not the AI.