Difference Between Sonnet 4.5 and Sonnet 4.6 on a Spatial Reasoning Benchmark (MineBench)
A new benchmark reveals a slight but measurable performance improvement, costing $80 to test.
Deep Dive
Independent developer Ammaar Alam benchmarked Anthropic's Claude Sonnet 4.6 against its predecessor, Sonnet 4.5, using the custom spatial reasoning test MineBench. Both models used the beta 1M-token context window and maximum thinking effort. The test, which cost roughly $80 to run, showed 4.6 performing slightly better, though the developer notes frequent JSON validation errors from Anthropic's models may have impacted results.
Why It Matters
For developers, even incremental model improvements can impact complex task performance in areas like 3D reasoning and code generation.