Developer Tools

CRANE merges Instruct and Thinking models for 19.5% boost in code agents

Training-free editing combines reasoning and discipline to beat both models individually.

Deep Dive

Code agents must simultaneously reason over long codebases and adhere to strict tool-use protocols. However, in pairs of Instruct and Thinking checkpoints, these capabilities are misaligned: Instruct models are concise and tool-disciplined, while Thinking models offer superior planning and recovery but over-deliberate, degrading agent performance. Researchers propose CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the difference (delta) between Thinking and Instruct as a pool of candidate reasoning edits. By applying magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits beneficial for both reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions, CRANE effectively merges the two checkpoints without additional training.

Results across three benchmarks demonstrate that CRANE outperforms either individual model and alternative merging strategies. On Roo-Eval, CRANE achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B. On SWE-bench-Verified, it resolves up to 14 additional instances at both model scales (reaching 122/500 and 180/500). On Terminal-Bench v2, pass1/pass5 improve by up to 2.3%/7.8%, hitting 7.6%/17.9% and 14.8%/30.3% respectively. These gains come while preserving the Instruct model's efficiency, making CRANE a practical solution for improving code agent performance without expensive fine-tuning.

Key Points
  • CRANE is a training-free parameter-editing method that merges Instruct and Thinking checkpoints for code agents.
  • On Roo-Eval, it boosts pass1 by 19.5% for Qwen3-30B-A3B (to 66.2%) and 8.7% for Qwen3-Next-80B-A3B (to 81.5%).
  • SWE-bench-Verified sees up to 14 additional resolved instances at both scales, reaching 180/500 on the larger model.

Why It Matters

Enables code agents to reason deeply without sacrificing tool discipline, boosting real-world software engineering tasks.