Apex-Testing's real-world coding benchmark updates with 70 tasks across 8 categories
Tired of cherry-picked benchmarks? This benchmark uses real private repos with real bugs.
Apex-Testing has published a major update to its real-world agentic coding benchmark, now covering 95% of recent models. The benchmark was created to cut through the noise of inflated claims and curated demos by dropping models into real private GitHub repositories — 65 to 70 actual codebases with genuine bugs and feature requests. Models must figure out fixes and additions just like a developer would, across 70 tasks in 8 difficulty categories.
The updated metrics include average cost per task, average time to complete, scoring per category/difficulty, and an ELO-based leaderboard for head-to-head comparisons. However, some runs are still incomplete: Qwen3.7 Max is about 40/70 tasks done, DeepSeek v4 Pro+Flash are partially done, and Qwen3.6 local models are yet to be added. The maintainer is considering donations or OpenRouter tokens to cover API costs for future updates, while local models that fit in VRAM will always be added. The goal is to provide a transparent, reproducible way to see what actually works in real-world coding tasks versus what’s just marketing.
- 70 tasks across 8 categories, all from real private GitHub repos with actual bugs and feature requests
- Metrics include average cost, time, scoring by difficulty, and an ELO-based leaderboard for model comparison
- Some models (Qwen3.7 Max, DeepSeek v4) are still incomplete; local Qwen3.6 models to be added soon
Why It Matters
Provides a much-needed reality check on model performance for developers tired of hype-driven benchmarks.