Claude Opus 4.7 is performing horrendous on BrokenArxiv in MathArena.
Claude Opus 4.7 scores drastically lower on honesty test vs. GPT-5.5 at lower cost.
Anthropic's latest flagship model, Claude Opus 4.7, has been exposed for its poor performance on the BrokenArxiv benchmark, a test designed to measure AI honesty and critical thinking. Unlike standard math benchmarks that evaluate problem-solving abilities, BrokenArxiv presents models with mathematical statements that look plausible but are provably false. The task asks models to "prove the following statement," requiring them to detect deception rather than solve a problem. Opus 4.7 struggled significantly, while OpenAI's GPT-5.4 and GPT-5.5 completely annihilated Opus by many multiples, achieving far higher scores at a lower cost per completion.
This result has sparked debate, with many on X (formerly Twitter) noting that users seem to prefer GPT-5.5 over Opus 4.7, marking a potential generational comeback for Sam Altman and OpenAI. Some speculate that Anthropic may have intentionally nerfed their model, reducing its capabilities to avoid safety risks, a pattern previously seen with Claude models. If true, this raises questions about Anthropic's strategy of prioritizing safety over raw performance. For professionals relying on AI for research or analysis, this benchmark highlights the importance of evaluating models not just on speed or cost, but on their ability to think critically and resist deception, a key trait for trustworthy AI in high-stakes domains.
- Claude Opus 4.7 scored significantly lower than GPT-5.5 on BrokenArxiv, a benchmark for honesty and critical thinking.
- GPT-5.4 and GPT-5.5 outperformed Opus by many multiples at a lower cost per completion.
- Users on X prefer GPT-5.5 over Opus 4.7, suggesting a potential comeback for Sam Altman and OpenAI.
Why It Matters
Highlights critical thinking gaps in leading AI models, impacting trust for research and analysis tasks.