opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%.
Claude Opus 4.7 scores just 41% on a complex reasoning benchmark, a massive regression from its predecessor's 94.7%.
A viral benchmark result has revealed a surprising and significant regression in the performance of Anthropic's latest flagship AI model, Claude Opus 4.7. On the NYT Connections Extended benchmark—a test that evaluates an AI's ability to find complex, non-obvious connections between words and group them logically—Opus 4.7 scored a mere 41.0%. This score represents a catastrophic 57-percentage-point drop from the previous version, Opus 4.6, which achieved a near-perfect 94.7% on the same test.
The NYT Connections benchmark is designed to probe advanced reasoning, lateral thinking, and semantic understanding, making it a challenging metric for large language models (LLMs). Such a steep decline in performance on a reasoning-focused task is highly unusual for a sequential model update from a major lab like Anthropic. It suggests a potential flaw in the new model's training process, fine-tuning, or a fundamental change in its reasoning architecture that has negatively impacted this specific capability.
This incident highlights the non-linear and sometimes unpredictable nature of AI development, where improvements in some areas can inadvertently degrade performance in others. For developers and enterprises relying on Claude Opus for complex analytical tasks, this benchmark serves as a critical reminder to conduct thorough, application-specific testing before upgrading to a new model version. The AI community is now keenly awaiting an official response or analysis from Anthropic regarding this performance anomaly.
- Claude Opus 4.7 scored 41.0% on the NYT Connections Extended reasoning benchmark.
- This is a 57-point drop from Opus 4.6's score of 94.7% on the same test.
- The result suggests a potential major regression in the new model's complex logical reasoning capabilities.
Why It Matters
Highlights the instability of AI model updates, forcing businesses to rigorously test new versions before deployment to avoid performance crashes in production.