Open Source

Qwen3.5-35B-A3B hits 37.8% on SWE-bench Verified Hard — nearly matching Claude Opus 4.6 (40%) with the right verification strategy

A tiny 3B-active-parameter model nearly matches Claude Opus 4.6's coding performance with a simple verification strategy.

Deep Dive

A breakthrough in AI coding agents demonstrates that sophisticated verification strategies can dramatically boost the performance of smaller, more efficient models. Alibaba's Qwen team's Qwen3.5-35B-A3B, a Mixture-of-Experts model with only 3 billion active parameters, achieved a 37.8% cumulative resolution rate on the notoriously difficult SWE-bench Verified Hard subset (45 tasks). This performance brings it remarkably close to Anthropic's flagship Claude Opus 4.6 model, which scores 40%, despite Opus being a vastly larger and more computationally expensive model. The key innovation wasn't in the model architecture itself, but in a dead-simple agent loop strategy called 'verify-on-edit,' which forces the AI to test its code changes immediately after making them.

The technical leap came from researcher Seungyoun Shin, who built a minimal agent harness with basic tools (file_read, file_edit, bash) and found that injecting a verification step after every file edit was the critical factor. This 'nudge'—a prompt telling the model to write and run a test script—catapulted performance from a baseline of 22.2% to 37.8%. Notably, more complex approaches like Monte Carlo Tree Search (MCTS) performed worse, breaking the small model's reasoning flow. On the full 500-task SWE-bench benchmark, the tuned agent achieved 67.0%, positioning it competitively with much larger systems. This research, shared openly on GitHub, underscores a major shift: for practical AI coding assistants, strategic agent design and rigorous self-verification may be more impactful than simply scaling model parameters, paving the way for more capable and cost-effective AI developers.

Key Points
  • Qwen3.5-35B-A3B, a 3B-active-param MoE model, scored 37.8% on SWE-bench Verified Hard, nearly matching Claude Opus 4.6's 40%.
  • A 'verify-on-edit' agent strategy—testing after every code change—boosted performance from 22% to 38%, while complex MCTS failed.
  • On the full 500-task SWE-bench, the agent hit 67.0%, showing efficient verification can rival larger models' coding capabilities.

Why It Matters

Proves that smarter agent workflows, not just bigger models, are crucial for building efficient, cost-effective AI coding assistants.