Qwen3.6-27B achieved ~95% plan generation accuracy, matching Claude on multi-step coding workflows?

Qwen3.6-27B achieved ~95% plan generation accuracy, matching Claude on multi-step coding workflows.

Tool-call JSON outputs had a 12% format error rate vs Claude's 0.5%, requiring strict enforcement at the execution boundary?

Tool-call JSON outputs had a 12% format error rate vs Claude's 0.5%, requiring strict enforcement at the execution boundary.

Long-context drift began past 14k tokens; effective limit was ~12k tokens before memory degradation?

Long-context drift began past 14k tokens; effective limit was ~12k tokens before memory degradation.

Open Source

Qwen3.6-27B nearly matches Claude as local agent reasoning layer with 95% plan accuracy

r/LocalLLaMA June 02, 2026

⚡47 coding workflows tested: local Qwen matches Claude on plans but struggles with tool calls.

Deep Dive

In a 14-day test, a developer swapped Claude for Alibaba’s Qwen3.6-27B as the reasoning layer in a multi-agent orchestrator running on a single RTX 3090 (24GB VRAM, Q6_K quantization, ~22GB on-GPU). The system handled plan generation, memory extraction, and auto-review for 47 multi-step coding workflows across two real repos. Qwen produced multi-step plans with ~95% schema-valid success after minimal prompt tweaks—comparable to Claude. Memory extraction (Mem0-style) worked flawlessly, and auto-review caught about 60% of the bugs Claude’s review would have flagged, all free of API costs.

However, Qwen faltered where precision mattered. Tool-call JSON outputs had a 12% format error rate—wrong field names, hallucinated signatures—versus Claude’s ~0.5%. Long-context drift appeared past 14k tokens, with the model misremembering earlier decisions; the practical ceiling was ~12k tokens, requiring aggressive summarize-and-reset. Cascade failures also occurred: when a sub-agent failed, Qwen sometimes generated downstream steps assuming success, leading to three cascading hallucinations in 47 runs (non-critical due to plan gating). The key takeaway: Qwen3.6-27B is a viable reasoning layer for local agents but demands structured-output enforcement, plan-approval gating, and explicit re-plan-on-failure logic. The 12% tool-call gap is the remaining obstacle before local models can fully replace cloud reasoning.

Key Points

Qwen3.6-27B achieved ~95% plan generation accuracy, matching Claude on multi-step coding workflows.
Tool-call JSON outputs had a 12% format error rate vs Claude's 0.5%, requiring strict enforcement at the execution boundary.
Long-context drift began past 14k tokens; effective limit was ~12k tokens before memory degradation.

Why It Matters

Local models are closing the gap for agent reasoning but still need guardrails to match cloud reliability in execution.

Read Original Article

Qwen3.6-27B nearly matches Claude as local agent reasoning layer with 95% plan accuracy

Why It Matters

Related Articles

🚀 Stay Ahead in AI