SWE-rebench leaderboard drops 110 fresh coding tasks, benchmarks GPT-5.5 and Opus 4.7
New complex Python tasks from real GitHub PRs test top models side by side.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
The SWE-rebench leaderboard just dropped a major update: 110 fresh Python tasks from real GitHub PRs created in March, April, and part of May. Models read real issues, edit code, and must make the full test suite pass — keeping the standard SWE-bench format. This batch is larger than usual so models can be evaluated on a broader set of tasks. Coming next week: Gemini Flash 3.5, DeepSeek v4 Pro, Qwen3.5-397B-A17B, plus smaller models for local development. Expect more frequent updates over larger task batches, and the team is working on adding multilingual tasks and other features. Join the leaderboard channel on Discord to discuss models and share feedback.
- 110 fresh Python tasks from real GitHub PRs (March–May 2026) added to SWE-rebench.
- Evaluated models include GPT-5.5, Opus 4.7, Cursor Composer 2.5, and Kimi K2.6.
- Upcoming models: Gemini Flash 3.5, DeepSeek v4 Pro, Qwen3.5-397B-A17B, plus local developer models.
Why It Matters
Developers get a real-task benchmark to compare code AI models, helping choose the best for production.