Open Source

Qwen3.6-27B nearly matches Claude as local agent reasoning layer with 95% plan accuracy

47 coding workflows tested: local Qwen matches Claude on plans but struggles with tool calls.

Deep Dive

In a 14-day test, a developer swapped Claude for Alibaba’s Qwen3.6-27B as the reasoning layer in a multi-agent orchestrator running on a single RTX 3090 (24GB VRAM, Q6_K quantization, ~22GB on-GPU). The system handled plan generation, memory extraction, and auto-review for 47 multi-step coding workflows across two real repos. Qwen produced multi-step plans with ~95% schema-valid success after minimal prompt tweaks—comparable to Claude. Memory extraction (Mem0-style) worked flawlessly, and auto-review caught about 60% of the bugs Claude’s review would have flagged, all free of API costs.

However, Qwen faltered where precision mattered. Tool-call JSON outputs had a 12% format error rate—wrong field names, hallucinated signatures—versus Claude’s ~0.5%. Long-context drift appeared past 14k tokens, with the model misremembering earlier decisions; the practical ceiling was ~12k tokens, requiring aggressive summarize-and-reset. Cascade failures also occurred: when a sub-agent failed, Qwen sometimes generated downstream steps assuming success, leading to three cascading hallucinations in 47 runs (non-critical due to plan gating). The key takeaway: Qwen3.6-27B is a viable reasoning layer for local agents but demands structured-output enforcement, plan-approval gating, and explicit re-plan-on-failure logic. The 12% tool-call gap is the remaining obstacle before local models can fully replace cloud reasoning.

Key Points
  • Qwen3.6-27B achieved ~95% plan generation accuracy, matching Claude on multi-step coding workflows.
  • Tool-call JSON outputs had a 12% format error rate vs Claude's 0.5%, requiring strict enforcement at the execution boundary.
  • Long-context drift began past 14k tokens; effective limit was ~12k tokens before memory degradation.

Why It Matters

Local models are closing the gap for agent reasoning but still need guardrails to match cloud reliability in execution.