I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt
A user's vanilla setup with Qwen 3.5-35B and 27B shows minimal reasoning tokens, contradicting widespread reports of loops.
A detailed user report is challenging the viral narrative that Qwen's 3.5 series models, specifically the 35B and 27B parameter versions, suffer from 'overthinking'—getting caught in extended reasoning loops and consuming excessive tokens. The user, running quantized versions (UD-Q4_K_XL) from Unsloth on an RTX 5090 with llama.cpp, states they have experienced the opposite: impressively few tokens used for high-quality responses. Their key finding is that using 100% default inference parameters, with no custom settings sent during prompts, yields stable and efficient performance. They posit that the widespread reports of problematic behavior may stem from users applying non-recommended parameters or overloading the models with dozens of tool definitions for complex agentic workflows.
The user's setup is notably simple. They run the models via a standard llama-server with basic arguments (--jinja -fa 1) and a 100k context window, exclusively for a chat-style interface with access to only four simple tools (two for web search, one for image manipulation, and one for server queries). They employ a straightforward system prompt that emphasizes step-by-step thinking and accuracy. This experience suggests that the perceived instability of Qwen 3.5 models might be a configuration or use-case issue rather than a fundamental flaw, highlighting the critical importance of sharing reproducible setup details when discussing LLM performance.
- User runs Qwen 3.5-35B-A3B and 27B with 100% default parameters, reporting no 'overthinking' or token waste.
- Suggests the issue may be caused by bad custom parameters or overloaded agent setups with many tools.
- Setup uses a simple 4-tool chat interface on an RTX 5090 with llama.cpp, contradicting widespread performance complaints.
Why It Matters
Highlights how model performance is highly dependent on correct configuration, saving developers time troubleshooting non-existent bugs.