110 fresh Python tasks from real GitHub PRs (March–May 2026) added to SWE-rebench?

110 fresh Python tasks from real GitHub PRs (March–May 2026) added to SWE-rebench.

Evaluated models include GPT-5.5, Opus 4.7, Cursor Composer 2.5, and Kimi K2.6?

Evaluated models include GPT-5.5, Opus 4.7, Cursor Composer 2.5, and Kimi K2.6.

Gemini Flash 3.5, DeepSeek v4 Pro, Qwen3.5-397B-A17B, plus local developer models.

Open Source

SWE-rebench leaderboard drops 110 fresh coding tasks, benchmarks GPT-5.5 and Opus 4.7

r/LocalLLaMA May 28, 2026

⚡New complex Python tasks from real GitHub PRs test top models side by side.

Deep Dive

The SWE-rebench leaderboard just dropped a major update: 110 fresh Python tasks from real GitHub PRs created in March, April, and part of May. Models read real issues, edit code, and must make the full test suite pass — keeping the standard SWE-bench format. This batch is larger than usual so models can be evaluated on a broader set of tasks. Coming next week: Gemini Flash 3.5, DeepSeek v4 Pro, Qwen3.5-397B-A17B, plus smaller models for local development. Expect more frequent updates over larger task batches, and the team is working on adding multilingual tasks and other features. Join the leaderboard channel on Discord to discuss models and share feedback.

Key Points

110 fresh Python tasks from real GitHub PRs (March–May 2026) added to SWE-rebench.
Evaluated models include GPT-5.5, Opus 4.7, Cursor Composer 2.5, and Kimi K2.6.
Upcoming models: Gemini Flash 3.5, DeepSeek v4 Pro, Qwen3.5-397B-A17B, plus local developer models.

Why It Matters

Developers get a real-task benchmark to compare code AI models, helping choose the best for production.

Read Original Article

SWE-rebench leaderboard drops 110 fresh coding tasks, benchmarks GPT-5.5 and Opus 4.7

Why It Matters

Related Articles

🚀 Stay Ahead in AI