I built a benchmark that tests coding LLMs on REAL codebases (65 tasks, ELO ranked)
Solo developer's benchmark tests LLMs on actual codebases, revealing surprising cost-performance gaps.
Deep Dive
Developer hauhau901 built APEX Testing, a benchmark evaluating coding LLMs on 65 real-world tasks across 8 categories like React and debugging. Models are tested on fresh codebase clones and graded by multiple SOTA models plus human review, then ranked via ELO. The project reveals unexpected results like GPT-5.1 Codex Mini outperforming newer versions and highlights significant cost differences between similarly-scored models.
Why It Matters
Provides developers with practical performance data to choose cost-effective coding assistants for real engineering work, not just synthetic benchmarks.