70 tasks across 8 categories, all from real private GitHub repos with actual bugs and feature requests?

70 tasks across 8 categories, all from real private GitHub repos with actual bugs and feature requests

Metrics include average cost, time, scoring by difficulty, and an ELO-based leaderboard for model comparison?

Metrics include average cost, time, scoring by difficulty, and an ELO-based leaderboard for model comparison

Some models (Qwen3.7 Max, DeepSeek v4) are still incomplete; local Qwen3.6 models to be added soon?

Some models (Qwen3.7 Max, DeepSeek v4) are still incomplete; local Qwen3.6 models to be added soon

Open Source

Apex-Testing's real-world coding benchmark updates with 70 tasks across 8 categories

r/LocalLLaMA May 23, 2026

⚡Tired of cherry-picked benchmarks? This benchmark uses real private repos with real bugs.

Deep Dive

Apex-Testing has published a major update to its real-world agentic coding benchmark, now covering 95% of recent models. The benchmark was created to cut through the noise of inflated claims and curated demos by dropping models into real private GitHub repositories — 65 to 70 actual codebases with genuine bugs and feature requests. Models must figure out fixes and additions just like a developer would, across 70 tasks in 8 difficulty categories.

The updated metrics include average cost per task, average time to complete, scoring per category/difficulty, and an ELO-based leaderboard for head-to-head comparisons. However, some runs are still incomplete: Qwen3.7 Max is about 40/70 tasks done, DeepSeek v4 Pro+Flash are partially done, and Qwen3.6 local models are yet to be added. The maintainer is considering donations or OpenRouter tokens to cover API costs for future updates, while local models that fit in VRAM will always be added. The goal is to provide a transparent, reproducible way to see what actually works in real-world coding tasks versus what’s just marketing.

Key Points

70 tasks across 8 categories, all from real private GitHub repos with actual bugs and feature requests
Metrics include average cost, time, scoring by difficulty, and an ELO-based leaderboard for model comparison
Some models (Qwen3.7 Max, DeepSeek v4) are still incomplete; local Qwen3.6 models to be added soon

Why It Matters

Provides a much-needed reality check on model performance for developers tired of hype-driven benchmarks.

Read Original Article

Apex-Testing's real-world coding benchmark updates with 70 tasks across 8 categories

Why It Matters

Related Articles

🚀 Stay Ahead in AI