PSA: Qwen3.6 ships with preserve_thinking. Make sure you have it on.
The new flag solves a cache invalidation issue, boosting agent performance and cutting token use.
Alibaba's latest Qwen 3.6 language model ships with a crucial fix for a persistent technical bug: a new `preserve_thinking` configuration flag. This addresses a KV (key-value) cache invalidation issue present in Qwen 3.5, where the model's internal reasoning was being stripped and re-serialized between each conversational turn, corrupting the cache and forcing redundant computation. The fix instructs users to set `"preserve_thinking": True` in their chat template arguments, replacing the previous workaround.
In practice, this means the AI can now maintain access to its entire prior reasoning chain. This is particularly transformative for agentic workflows, where an AI uses tools or makes multi-step decisions. The model can reference its own previous logic instead of starting from scratch each time, leading to more consistent decisions and, counterintuitively, often lower overall token usage by eliminating repetitive reasoning. Early testing shows a simple validation: ask the model to generate two random 20-digit numbers and provide only one; a subsequent request for the second number fails without the flag but succeeds with it, proving the reasoning context is preserved.
Adoption is underway, with support being added to inference servers. While LM Studio does not yet support the flag, an open pull request exists for oMLX. This update represents a significant behind-the-scenes engineering improvement that directly translates to more reliable and efficient AI applications, especially for developers building complex, stateful AI agents.
- Fixes a KV cache invalidation bug from Qwen 3.5 by preserving the model's internal reasoning context across turns.
- Enhances agent/tool-calling workflows, improving decision consistency and potentially reducing token consumption by up to 20-30% in redundant scenarios.
- Requires a config change: users must set `"preserve_thinking": True` in chat template arguments instead of the old `False` default.
Why It Matters
Developers building AI agents get more reliable, consistent, and cost-effective reasoning, fixing a major hidden performance bug.