Qwen 3.5 craters on hard coding tasks — tested all Qwen3.5 models (And Codex 5.3) on 70 real repos so you don't have to.
New benchmark tests 25+ models on 70 real repos, finds Qwen 3.5 397B craters 22% on hardest tasks.
The APEX Testing benchmark, developed by an independent researcher, has released comprehensive results evaluating over 25 coding LLMs on 70 real-world tasks from actual GitHub repositories. The updated benchmark now includes all Qwen 3.5 variants, GPT-5.3 Codex, and various quantized local models, using a sophisticated agentic tool-use system that allows models to explore codebases autonomously—creating a fairer comparison between cloud and local models. The most surprising finding was Qwen 3.5 397B's significant performance drop on complex 'master' tasks, falling 22% from its performance on easier challenges, while OpenAI's Codex 5.3 showed remarkable consistency across difficulty levels, essentially tying with GPT-5.2.
The technical breakdown reveals GLM-4.7's quantized version as the top local model with 1572 ELO, outperforming even the full 397B Qwen 3.5 cloud model. Qwen 3.5 27B proved surprisingly capable for single-GPU use at 1384 ELO, beating DeepSeek V3.2, while the 35B MoE model struggled with only 1256 ELO due to its limited active parameters. The benchmark's methodology emphasizes real-world applicability with tasks ranging from bug fixes to multi-file refactors, using pairwise ELO scoring with difficulty adjustments. With $3,000 already invested in testing, the project represents one of the most comprehensive independent evaluations of coding LLMs available, providing crucial data for developers choosing between cloud and local solutions.
- Qwen 3.5 397B drops 22% to 1194 ELO on 'master' tasks requiring multi-file coordination
- Quantized GLM-4.7 leads local models with 1572 ELO, beating all Qwen 3.5 variants including cloud
- GPT-5.3 Codex shows remarkable consistency, tying with GPT-5.2 across all difficulty levels
Why It Matters
Provides crucial performance data for developers choosing between expensive cloud APIs and local models for real coding work.