Open Source

SWE-rebench Leaderboard (Feb 2026): GPT-5.4, Qwen3.5, Gemini 3.1 Pro, Step-3.5-Flash and More

r/LocalLLaMA March 23, 2026

⚡Claude Opus 4.6 edges out GPT-5.2 and GLM-5 by less than 3% in competitive coding benchmark.

Deep Dive

The latest SWE-bench leaderboard for February 2026 reveals a fiercely competitive landscape for AI coding models. The benchmark, which tests models on 57 fresh GitHub pull request tasks, requires them to read real issues, edit code, and pass the full test suite. Anthropic's Claude Opus 4.6 maintains its lead with a 65.3% resolved rate and a strong pass@5 rate of approximately 70%. However, the margin is razor-thin, with OpenAI's GPT-5.2-medium close behind at 64.4%, and GLM-5 and GPT-5.4-medium tied at 62.8%. Google's Gemini 3.1 Pro Preview (62.3%) and DeepSeek-V3.2 (60.9%) round out a top-six cluster separated by less than 5 percentage points.

This tight grouping underscores the rapid pace of development, where incremental improvements in reasoning and long-context understanding are driving performance. Notably, open-weight and hybrid models are making significant strides. Alibaba's Qwen3.5-397B achieved a 59.9% resolved rate, while Step-3.5-Flash reached 59.6%, demonstrating that publicly available models are closing in on proprietary leaders. MiniMax's M2.5 model also stands out as a cost-efficient option with a competitive 54.6% score. The results indicate that the frontier of AI-assisted software engineering is no longer dominated by a single model but is a crowded field where architectural choices, scaling, and efficient context use are key differentiators for developers and enterprises choosing an AI coding assistant.

Key Points

Claude Opus 4.6 leads with a 65.3% resolved rate, but the top six models are within a 4.4% performance band.
Open-weight models like Qwen3.5-397B (59.9%) and Step-3.5-Flash (59.6%) are rapidly closing the gap on proprietary models.
The benchmark uses 57 real, recent GitHub PRs, testing a model's ability to understand issues and pass a full test suite.

Why It Matters

For developers, this signals more viable, cost-effective AI coding assistants and intense competition that will drive faster model improvements.

Read Original Article

SWE-rebench Leaderboard (Feb 2026): GPT-5.4, Qwen3.5, Gemini 3.1 Pro, Step-3.5-Flash and More

Why It Matters

Stay Ahead in AI