Open Source

You can use Qwen3.5 without thinking

r/LocalLLaMA February 25, 2026

⚡A simple command-line flag disables the model's internal reasoning, dramatically speeding up responses.

Deep Dive

A community discovery has revealed a major performance optimization for running Alibaba's Qwen3.5 language model locally. Users found that by passing a specific flag (`--chat-template-kwargs '{"enable_thinking": false}'`) to the popular llama.cpp inference server, they can disable the model's built-in 'Chain-of-Thought' reasoning feature. This internal 'thinking' process, while beneficial for complex reasoning, adds substantial latency to every response. The hack effectively tells the model to skip its deliberative step and generate answers directly, which is ideal for simpler, instruction-following tasks. Early adopters report response times nearly doubling in speed, making local deployment much more practical for real-time applications.

The optimization requires pairing the flag with a set of recommended generation parameters for 'instruct mode,' including a temperature of 0.7, top-p of 0.8, and adjusted penalty settings. Crucially, testers like Reddit user guiopen report that this speed boost comes without the significant quality degradation typically seen when disabling similar features in other models like GLM Flash. This suggests Qwen3.5's base model is robust enough for direct generation. The find highlights the growing importance of fine-grained runtime control in open-source AI and provides a template for users to trade off between reasoning depth and raw speed based on their specific use case.

Key Points

Add `--chat-template-kwargs '{"enable_thinking": false}'` to llama.cpp to disable Qwen3.5's internal reasoning.
Use recommended instruct-mode parameters: `--temp 0.7 --top-p 0.8 --repeat-penalty 1.0`.
Users report a major speed increase with no noticeable drop in quality for instruction-following tasks.

Why It Matters

Dramatically improves the speed and practicality of running powerful open-source models like Qwen3.5 on local hardware.

Read Original Article

You can use Qwen3.5 without thinking

Why It Matters

Stay Ahead in AI