Qwen3.5 Plus, GLM 5, Gemini 3.1 Pro, Sonnet 4.6, three new open source agents, and a lot more added to SanityBoard
Massive benchmark update reveals GPT-codex models' iteration advantage and infrastructure's surprising impact on scores.
SanityBoard, an AI evaluation platform, added 27 new benchmark results including models like Qwen3.5 Plus, GLM 5, and Gemini 3.1 Pro. The update reveals GPT-codex models excel at iterative tasks, scoring well in automated coding benchmarks, while Claude models perform better in interactive scenarios. Three new open-source coding agents (kilocode, cline, and pi) were also evaluated, with infrastructure quality significantly affecting performance scores across different providers.
Why It Matters
Provides crucial performance data for developers choosing between iterative GPT models and interactive Claude models for coding tasks.