Local model on coding has reached a certain threshold to be feasible for real work
Open-weight 27B model scores 38.2% on Terminal-Bench 2.0, rivaling GPT-5.1-Codex.
In a recent benchmark run, the open-weight Qwen 3.6-27B model achieved a score of 38.2% on Terminal-Bench 2.0 under default timeout constraints, matching the performance of hosted frontier models from late 2025. The test, conducted by the Antigma team, used the same settings as the official leaderboard to ensure a fair comparison. This result is significant because it demonstrates that open-weight models can now perform on par with earlier versions of top-tier hosted models like GPT-5.1-Codex (36.9%) and Claude Opus 4.1 (38.0%).
The key insight is not the absolute score—current hosted SOTA is around 80%—but the timing. This is the first time an offline model has closed the gap to within 6–8 months of hosted performance, making it viable for regulated, air-gapped, or batch deployment scenarios. The team also noted that MoE (mixture of experts) models still have an order of magnitude of improvement needed in inference speed. The findings suggest that open-weight models are rapidly becoming practical for real-world coding tasks that require data sovereignty or low latency.
- Qwen 3.6-27B scored 38.2% on Terminal-Bench 2.0, matching GPT-5.1-Codex (36.9%) and Claude Opus 4.1 (38.0%).
- This is the first time an open-weight model has closed the performance gap to within 6–8 months of hosted models.
- MoE models still need significant inference speed improvements to be viable for real-time use.
Why It Matters
Open-weight models are now viable for regulated, air-gapped, and batch coding deployments, reducing reliance on hosted APIs.