Open Source

Qwen 3.5 craters on hard coding tasks — tested all Qwen3.5 models (And Codex 5.3) on 70 real repos so you don't have to.

New benchmark tests 25+ models on 70 real repos, finds Qwen 3.5 397B craters 22% on hardest tasks.

Deep Dive

The APEX Testing benchmark, developed by an independent researcher, has released comprehensive results evaluating over 25 coding LLMs on 70 real-world tasks from actual GitHub repositories. The updated benchmark now includes all Qwen 3.5 variants, GPT-5.3 Codex, and various quantized local models, using a sophisticated agentic tool-use system that allows models to explore codebases autonomously—creating a fairer comparison between cloud and local models. The most surprising finding was Qwen 3.5 397B's significant performance drop on complex 'master' tasks, falling 22% from its performance on easier challenges, while OpenAI's Codex 5.3 showed remarkable consistency across difficulty levels, essentially tying with GPT-5.2.

The technical breakdown reveals GLM-4.7's quantized version as the top local model with 1572 ELO, outperforming even the full 397B Qwen 3.5 cloud model. Qwen 3.5 27B proved surprisingly capable for single-GPU use at 1384 ELO, beating DeepSeek V3.2, while the 35B MoE model struggled with only 1256 ELO due to its limited active parameters. The benchmark's methodology emphasizes real-world applicability with tasks ranging from bug fixes to multi-file refactors, using pairwise ELO scoring with difficulty adjustments. With $3,000 already invested in testing, the project represents one of the most comprehensive independent evaluations of coding LLMs available, providing crucial data for developers choosing between cloud and local solutions.

Key Points
  • Qwen 3.5 397B drops 22% to 1194 ELO on 'master' tasks requiring multi-file coordination
  • Quantized GLM-4.7 leads local models with 1572 ELO, beating all Qwen 3.5 variants including cloud
  • GPT-5.3 Codex shows remarkable consistency, tying with GPT-5.2 across all difficulty levels

Why It Matters

Provides crucial performance data for developers choosing between expensive cloud APIs and local models for real coding work.