Claude Opus 4.7 (high) unexpectedly performs significantly worse than Opus 4.6 (high) on the Thematic Generalization Benchmark: 80.6 → 72.8.
New flagship model scores 72.8 vs 80.6 for previous version on thematic generalization test.
Anthropic's newly released Claude Opus 4.7, the latest iteration of its flagship AI model, has demonstrated a surprising performance regression on a specialized reasoning benchmark. According to viral analysis on GitHub, Opus 4.7 scored 72.8 on the Thematic Generalization Benchmark, a significant drop from Opus 4.6's score of 80.6. This 10% decline is particularly notable because the benchmark tests advanced cognitive abilities: the model must infer a latent theme from a few examples, use anti-examples to reject broader but incorrect patterns, and then identify the single correct match among close distractors.
A concrete example illustrates the failure. When the theme was 'religious texts written on animal skin,' Opus 4.6 correctly identified the conjunction of both attributes. Opus 4.7, however, lost the material constraint and behaved as if 'religious manuscript' alone was sufficient, despite anti-examples clearly distinguishing between 'animal skin but not religious' and 'religious but not animal skin.' The regression is more severe in the model's 'no reasoning' mode, where Opus 4.7 scored 52.6 compared to Opus 4.6's 68.8. Furthermore, the new 'xhigh' reasoning setting does not appear to be an improvement, with the model generating substantially longer outputs (averaging 1,121 tokens) without achieving better accuracy.
This incident underscores the non-linear nature of AI model development, where newer versions can sometimes regress on specific capabilities even as they advance on others. For developers and enterprises relying on Claude for complex reasoning tasks, this benchmark result serves as a critical reminder to thoroughly evaluate new model versions against their specific use cases rather than assuming monotonic improvement. The community's rapid identification and analysis of this issue also highlights the growing sophistication of open benchmarking in the AI ecosystem.
- Claude Opus 4.7 scores 72.8 on Thematic Generalization Benchmark, a 10% drop from Opus 4.6's 80.6.
- The model fails on specific constraints, e.g., confusing 'religious manuscript' for the more precise 'religious texts written on animal skin.'
- The 'no reasoning' mode regressed more severely, scoring 52.6 vs. the previous version's 68.8.
Why It Matters
Highlights that newer AI models aren't always better, forcing professionals to validate updates for critical reasoning tasks.